Nadam

Momentum and Nesterov Accelerated Gradient

We know that momentum updates the weights in the following way

$$
g_t \leftarrow \dfrac{\partial E}{\partial W_t}\\
V \leftarrow \beta V + \epsilon g_t\\
W \leftarrow W - V\\
$$

Writing the update in one step gives us

$$
W_{t+1} \leftarrow W_{t} - (\beta V_{t-1} + \epsilon g_t)
$$

Nesterov’s accelearated gradient changes momentum in the following way

$$
W_{t} \leftarrow W_{t-1} - \alpha V_{t-1}
g_t \leftarrow \dfrac{\partial E}{\partial W_{t}}
V_{t} \leftarrow \beta V_{t-1} + (\epsilon) g_t
W_{t+1} \leftarrow W_t - V_t
$$

Instead of applying two updates to $W$, we can modify this to produce a single update variant of NAG. This can be done by modifying the final step in the NAG steps above. Instead of subtracting $\beta V_{t-1} + \epsilon g_t$ from the final $W$ we can subtract $$\beta V_{t} + \epsilon g_t$$ which produces the same correction effect we intend with NAG. So update of NAG in a single step becomes

$$
W_{t+1} \leftarrow W_{t} - (\beta V_{t} + \epsilon g_t)
$$

Notice that the difference in updates between momentum and NAG is in the final update being done with $V_{t-1}$ and $V_t$ respectively.

Adam’s final step

Adam’s update is as follows

$$
m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1)g_t\\
v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^{2}
\hat{m_t} \leftarrow \dfrac{m_t}{1-\beta_1^t}\\
\hat{v_t} \leftarrow \dfrac{v_t}{1-\beta_2^t}
W_t \leftarrow W_{t-1} - \eta \dfrac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}
$$

Rewriting the final update step, we get

$$
W_t \leftarrow W_{t-1} - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\dfrac{\beta_1 m_{t-1}}{1-\beta_1^t} + \dfrac{(1 - \beta_1)g_t}{1-\beta_1^t}\bigg)
$$

In this equation we can approximate $\dfrac{m_{t-1}}{1-\beta_1^t}$ with $\hat{m}_{t-1}$. It’s only an approximation because $\hat{m}_{t-1}$ would be $\dfrac{m_{t-1}}{1-\beta_1^{t-1}}$. With this approximation, we get the final Adam step as

$$
W_t \leftarrow W_{t-1} - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\beta_1 \hat{m}_{t-1} + \dfrac{(1 - \beta_1)g_t}{1-\beta_1^t}\bigg)
$$

Nesterov + Adam

Note that Adam’s final step resembles that of momentum with $\hat{m_{t-1}}$. Introducing Nesterov’s correction into the final step of Adam gives us the final update as

$$
W_{t+1} \leftarrow W_t - \dfrac{\eta}{\sqrt{\hat{v_t}} + \epsilon}\bigg(\beta_1\hat{m_t} + \dfrac{1 - \beta_1}{1 - \beta_1^t}g_t\bigg)
$$
XKCD Comic