Momentum and Nesterov Accelerated Gradient
We know that momentum updates the weights in the following way
$$
g_t \leftarrow \dfrac{\partial E}{\partial W_t}\\
V \leftarrow \beta V + \epsilon g_t\\
W \leftarrow W - V\\
$$
Writing the update in one step gives us
$$
W_{t+1} \leftarrow W_{t} - (\beta V_{t-1} + \epsilon g_t)
$$
Nesterov’s accelearated gradient changes momentum in the following way
$$
W_{t} \leftarrow W_{t-1} - \alpha V_{t-1}
g_t \leftarrow \dfrac{\partial E}{\partial W_{t}}
V_{t} \leftarrow \beta V_{t-1} + (\epsilon) g_t
W_{t+1} \leftarrow W_t - V_t
$$
Instead of applying two updates to $W$, we can modify this to produce a single update variant of NAG. This can be done by modifying the final step in the NAG steps above. Instead of subtracting $\beta V_{t-1} + \epsilon g_t$ from the final $W$ we can subtract $$\beta V_{t} + \epsilon g_t$$ which produces the same correction effect we intend with NAG. So update of NAG in a single step becomes
$$
W_{t+1} \leftarrow W_{t} - (\beta V_{t} + \epsilon g_t)
$$
Notice that the difference in updates between momentum and NAG is in the final update being done with $V_{t-1}$ and $V_t$ respectively.
Adam’s final step
Adam’s update is as follows
$$
m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1)g_t\\
v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^{2}
\hat{m_t} \leftarrow \dfrac{m_t}{1-\beta_1^t}\\
\hat{v_t} \leftarrow \dfrac{v_t}{1-\beta_2^t}
W_t \leftarrow W_{t-1} - \eta \dfrac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}
$$
Rewriting the final update step, we get
$$
W_t \leftarrow W_{t-1} - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\dfrac{\beta_1 m_{t-1}}{1-\beta_1^t} + \dfrac{(1 - \beta_1)g_t}{1-\beta_1^t}\bigg)
$$
In this equation we can approximate $\dfrac{m_{t-1}}{1-\beta_1^t}$ with $\hat{m}_{t-1}$. It’s only an approximation because $\hat{m}_{t-1}$ would be $\dfrac{m_{t-1}}{1-\beta_1^{t-1}}$. With this approximation, we get the final Adam step as
$$
W_t \leftarrow W_{t-1} - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\beta_1 \hat{m}_{t-1} + \dfrac{(1 - \beta_1)g_t}{1-\beta_1^t}\bigg)
$$
Nesterov + Adam
Note that Adam’s final step resembles that of momentum with $\hat{m_{t-1}}$. Introducing Nesterov’s correction into the final step of Adam gives us the final update as
$$
W_{t+1} \leftarrow W_t - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\beta_1\hat{m_t} + \dfrac{1 - \beta_1}{1 - \beta_1^t}g_t\bigg)
$$