The Gradient descent algorithm we saw in an earlier post just gives you an idea of how gradient descent works. We’ve seen that
$$W \leftarrow W - \alpha E’(W)\\
$$
This is just a rough formula to get you started. In practice what $E’(W)$ is calculating is the average of all the gradients due to every point in your training data. This is known as Vanilla Gradient Descent or Batch Gradient Descent. Large datasets with millions of instances are easily available nowadays. This presents a challenge to the vanilla gradient descent. The weights are updated after calculating the average of all the data points available, which significantly increases training time.
Let’s see some optimizations to keep these under control.