RMSProp

Exponentially Decaying average

The idea of RMSProp is similar to what you saw in the AdaDelta’s first step. It maintains a exponentially decaying average of the squared gradients.

$$
\text{let} \frac{\partial E}{\partial W} = g, \text{ At the step t of training}\\
\underbrace{E[g^2]_t}_{\text{average of squared gradient}} = \gamma E[g^2]_{t-1} + (1-\gamma){g_t}^2\\
W \leftarrow W - \dfrac{\eta}{\sqrt{\epsilon + E[g^2]_t}}g_t
$$

A good suggested value for $\gamma$ is 0.9, and for $\eta$ is 0.001. $\epsilon$ has the same job as in AdaGrad.

In the next post let’s see “Adam” (Adaptive Moments), which improvises on RMSProp and momentum. It is considered a default in many DeepLearning settings. Enjoy the end of post comic.

XKCD Comic