All Matters AI

Vectorization

Posted on 2018-07-17 | In Practical Deep Learning |

Reading time≈ 0:02

Disclaimer

Until the last post, we’ve discussed the basic theory of neural networks. Let’s start with some practice from this post on. You might have seen titles of this form “create your own deep learning thingy in less than 20 or so lines of code”. While the title actually does what it says, under those 20 lines of code are 100s or even 1000s of lines abstracted away for convenience. Starting with this post, you’ll learn to code up a mini deep learning library on your own. As I’ve mentioned earlier, this is not to re-invent the wheel or bring about a new deep learning library to the pool, it’s just to enhance your own understanding of what’s going on under the hood of a deep learning library. We’ll be using python and the numpy module.

Optimizing Algorithms - Recap and Resources

Posted on 2018-07-04 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:01

Till now we’ve seen the following optimizations for gradient descent.

Nadam

Posted on 2018-07-02 | Post modified: 2018-07-04 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:02

Momentum and Nesterov Accelerated Gradient

We know that momentum updates the weights in the following way

$$
g_t \leftarrow \dfrac{\partial E}{\partial W_t}\\
V \leftarrow \beta V + \epsilon g_t\\
W \leftarrow W - V\\
$$

Writing the update in one step gives us

$$
W_{t+1} \leftarrow W_{t} - (\beta V_{t-1} + \epsilon g_t)
$$

AdaMax

Posted on 2018-06-30 | Post modified: 2018-06-30 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:01

A quick tweak to Adam

AdaMax is a variant of Adam. It uses an infinity norm to normalize $\hat{m_t}$ instead of the $L^2$ norm of individual current and past gradients used in Adam. So the new steps for updating weights is as follows

Adam

Posted on 2018-06-28 | Post modified: 2018-07-03 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:02

First and Second Moments

In contrast to what we’ve seen so far, Adam relies on two variables for changing the learning rate. They are $m_t$, $v_t$ and are initialized by zeros at the start. They are updated in the following way

RMSProp

Posted on 2018-06-26 | Post modified: 2018-06-26 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:01

Exponentially Decaying average

The idea of RMSProp is similar to what you saw in the AdaDelta’s first step. It maintains a exponentially decaying average of the squared gradients.

AdaDelta

Posted on 2018-06-24 | Post modified: 2018-06-25 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:02

Average over a window

AdaDelta corrects the decreasing gradient problem caused by AdaGrad. It does this with a simple idea.

AdaGrad

Posted on 2018-06-22 | Post modified: 2018-06-22 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:01

Recognizing Important Features

The idea of AdaGrad is a simple one. Normalize the gradient updates by the sum of the gradients so far. Suppose we run the training for t steps then the weight update for the t+1 step is as follows

Nesterov Accelerated Gradient

Posted on 2018-06-20 | Post modified: 2018-07-03 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:01

Correction Factor

While using momentum, we updated $V$ and then applied part of $V$ to the weight update. The idea of Nesterov Accelerated Gradient(NAG) is to update the current weight by a small part of $V$ and then calculate the gradient of the updated weight. The small part by which we update the current weight is called the correction factor which is $\alpha V$. Simply put, we changed the position where the gradient is calculated. So for each iteration in gradient descent, the updates would be as follows

Momentum

Posted on 2018-06-15 | Post modified: 2018-06-18 | In AI , Deep Learning Fundamentals |

Reading time≈ 0:03

Gradient Descent promises us to take a step closer to a minima (local or global) on each iteration. But due to the high dimensionality, it’s dealing with a very complicated error surface. This introduces a problem of very slow convergence towards the minima. Let’s see a simple yet effective technique to overcome this problem.