Nesterov Accelerated Gradient

Correction Factor

While using momentum, we updated $V$ and then applied part of $V$ to the weight update. The idea of Nesterov Accelerated Gradient(NAG) is to update the current weight by a small part of $V$ and then calculate the gradient of the updated weight. The small part by which we update the current weight is called the correction factor which is $\alpha V$. Simply put, we changed the position where the gradient is calculated. So for each iteration in gradient descent, the updates would be as follows

$$
W \leftarrow W - \alpha V\\
V \leftarrow \alpha V - \epsilon \frac{\partial E}{\partial W}\\
W \leftarrow W - V
$$

Here $\alpha, \epsilon$ function the same way as $\beta$ in the standard momentum algorithm.

Comparison with momentum

The illustrations below show the difference between momentum and Nesterov Accelerated Gradient.

momentum illustration

Nesterov Accelerated Gradient illustration

Though this seems like a trivial change, it usually makes the velocity change in a quicker and more responsive way. This leads to better stability(not many fluctuations) than momentum and works better with a high $\alpha$ value.

In the next post let’s see AdaGrad. Enjoy the end of the post comic.
XKCD Comic