Cost Function and Gradient Descent

In the last post we’ve discussed how a Multilayer Perceptron works. Now let’s see how to search/update a set of weights to achieve a good classifer.

Error Function

We are on the search for the best set of weights for our classifier. Before we do that, we need something that tells us how good we are doing with the current set of weights. This is exactly what an error function AKA loss function AKA cost function does. It provides us a quantifiable metric depending on which we can update our set of weights. Let’s say our error function is $E(W)$ where W is the set of weights.

Gradient Descent

Our goal now is to minimize $E(W)$ for the classifier to work. To do that, we have $W$ that we can use to change $E(W)$. Here is where a basic idea from calculus helps. A small change in the input of a function is related to the derivative in the following way

$$f(\underbrace{x + \epsilon}_{\text{a small change in input}}) \approx f(x) + \underbrace{\epsilon f’(x)}_{\text{The corresponding change in output}} $$

So, an $\epsilon$ change in the input produces a $\epsilon f’(x)$ change in the output. This is all we need to decrease our error function. Some math recap before we build error minimizing algorithm.
The derivative of a function at a point is

positive if the function is increasing in the neighborhood of the point.
negative if it’s decreasing in the neighborhood of the point.
zero if it’s neither increasing or decreasing.(saddle points)

We can use this fact in our error function as shown below

$$E(W+\epsilon) \approx E(W) + \epsilon E’(W) \hspace{1cm}\text{for a small enough}\hspace{0.2cm} \epsilon$$
$$\implies E(W+\epsilon) < E(W)\hspace{1cm} if \hspace{0.2cm}E’(W) < 0 $$
$$\implies E(W-\epsilon) < E(W)\hspace{1cm} if \hspace{0.2cm}E’(W) > 0 $$

So, all we have to do now is decrease the weights when derivate is positive and vice versa. If we repeat this operation with a small enough $\epsilon$ we can minimize the error function. This is the whole idea of Gradient Descent. A simple form for the algorithm is

until we reach an error threshold or $E’(W) = 0$
$$W ← W - \epsilon E’(W)
$$

Choosing an error function

The simple yet powerful algorithm depends on the error function to work. So, let’s explore how to choose an error function. In the perceptron post we chose the number of misclassified points as our error function. But, # misclassified points can only have integer values and is a discrete function. This is a poor choice for an error function, because

In many situations you’ll find that the number of misclassified points can’t be decreased. The weights keep bouncing between a set of values.
Differentiating a discrete function.
All the misclassified points receive the same treatment. Points misclassified by a small margin aren’t any different from large error margins.

To overcome these challenges we can emphasize on the following characteristics

The error function must be continuous and differentiable.
Penalize misclassified points based on their error margins.

Designing such function is for another post. Let’s see some limitations of gradient descent before moving to the next post.

The limitations gradient descent

As long as $E’(W)$ is not zero, our weights keep updating and we’ll be approaching a minima in the error function. But what about the case when it’s zero, our algorithm terminates. But we don’t have to be at a minima when this happens. Points where $E’(W) = 0$ are called critical points. Three different scenarios for critical points are

Minima: Moving in any direction doesn’t decrease the function.
Maxima: Moving in any direction doesn’t increase the function.
Saddle Point: Moving in any direction doesn’t increase or decrease the function.

So our algorithm can stop at any of these situations. Even if we land on a minima it may not be the absolute lowest that is possible. When there are many local minima, we find ourselves stuck at the local minima. Different scenarios that can occur are shown in the figure below.
Approximate Minimization

Ideally, we would like to achieve this point (Global Minima), but it may not be possible.
This is at par with the global minima and is acceptable.
These points lead to poor performance.

We’ll discuss tricks and tips to mitigate these limitations in another post. Here’s a little treat for making it till the end.

Alexa joke