AdaGrad

Recognizing Important Features

The idea of AdaGrad is a simple one. Normalize the gradient updates by the sum of the gradients so far. Suppose we run the training for t steps then the weight update for the t+1 step is as follows

$$
W_i^{(t+1)} \leftarrow W_i^t - \frac{\alpha}{\sqrt{\epsilon + \sum_{\tau = 1}^{t}\frac{\partial E}{\partial W_i^{t}}}}\frac{\partial E}{\partial W_i^{t}}
$$

This has a very nice consequence. Weights that have high gradients frequently get a reduced update, due to the sum of all the updates so far in the denominator. Weights that have a small gradient or ones that receive infrequent updates are given more preference, due to the small sum in the denominator. $\epsilon$ is just a safeguard to prevent dividing by zero. Its value is in the order of $10^{-8}$.

Potential Drawbacks

Adagrad helps in tuning the learning rate and provides more weight to rare features. But this comes at a cost of learning rate becoming too small due to the sum of squared gradients in the denominator. In its training cycle, Adagrad reaches a point where the updates aren’t relevant anymore and weights are stuck at a point.

Let’s see some more algorithms which try to cover this flaw in adagrad starting with “Adadelta”. Enjoy your end of the post comic.
XKCD Comic