Loss function and Cross Entropy

In the gradient descent post, you’ve seen what an error function is. What the characteristics of a good loss function are. Let’s take a deep dive into how we achieve this.

The desired characteristics of a good loss function are

Continuous and Differentiable.
Penalizes wrong classification heavily and correct classifications weakly.

Prediction as a probability

So in the perceptron post we’ve used Heaviside step function to decide whether the output was 0 or 1. But our ideal loss function demands a continuous and differentiable function. Thus we can hack our loss function in such a way that it gives the probability that the given input belongs to a class. For example if there are two classes $A$ and $B$. After the hack we expect a result $P(A)$ from our error function. The probability of our input being $B$ can be easily calculated with $P(B) = 1 - P(A)$.

All we have to do to achieve this is change the activation function. In the activation post we’ve seen the three most common activation functions - $ReLU$, $sigmoid$ and $tanh$. $ReLU$ is only piece-wise continuous. For this reason it is avoided as the activation function of the final output. $tanh$ produces outputs in the range of $(-1, 1)$. But we want the output to be in the range of $[0, 1]$ for representing the probability. $sigmoid$ does the job here. It is a good choice producing outputs in the range of $(0, 1)$.

$$sigmoid(x) = \frac{1}{1+e^{-x}}$$
Sigmoid function graph

Finding the ideal function

We now have a neural network that produces a probability $p$. Our error can take advantage of this fact. Now the only thing left is the penalty part. For a point labeled 1, we need our error function to be

high when it’s output is closer to 0 (since we have a low probability that the point is 1)
low when it’s output is closer to 1. (zero when we’re absolutely confident i.e probability is 1.)

You can explore functions that have this characteristic. One of the most used function for this purpose is $-ln(p)$. See the response of $-ln(p)$ below.

log-loss or cross entropy

Cross Entropy

It’s time to design ourselves an error function that would work for both the labels $0$ and $1$. If $-ln(p)$ gives us the penalty for classifying label $1$, $-ln(1-p)$ is the penalty for classifying label $0$. Since $1-p$ is the probability of the point being 0. Now all we have to do is take the penalty for 1 when the true label is $1$ and the penalty for 0 when the label is $0$. We can do so easily by the following formula

$$
\text{when y is the true label and p is the probability that the label is 1} \\
-y.ln(p) - (1-y).ln(1-p) =
\begin{cases}
-ln(1-p) & {\text{if y=0}} \\
-ln(p) & {\text{if y=1}}
\end{cases}\\
$$

$-y.ln(p) - (1-y).ln(1-p)$ is known as cross-entropy.

Multi-class Classification

Till now we’ve seen the problems where there only two outcomes i.e binary classification. Now let’s see a problem with more than two labels. Assume we are given four temperatures taken at 12PM, 6PM, 12AM, 6AM of a particular city. Our task is to classify the given temperatures into spring, summer, autumn or winter. So, we need to have four outputs here. Each corresponding to a season. In binary classification, our labels were straight forward, either $0$ or $1$. But how are labels for a multi-class problems set? We prefer not to set the labels to $0, 1, 2, 3$. We’ll discuss why this is the case in another post. But for now let’s see a popular way of setting labels for multi-class problems.

One-hot encoding

We create an array of size $n$, where $n$ is the number of classes. Then we designate each of the classes a position in the array. For a label that needs to specify a class, the position corresponding to the class is filled with $1$ and rest all with $0$. Let’s see this in a concrete example. For the season problem above, one way to set labels is as follows
$$
label_{\text{spring}} = \begin{pmatrix}
1 \\
0 \\
0 \\
0
\end{pmatrix}, label_{\text{summer}} = \begin{pmatrix}
0 \\
1 \\
0 \\
0
\end{pmatrix}, label_{\text{autumn}} = \begin{pmatrix}
0 \\
0 \\
1 \\
0
\end{pmatrix}, label_{\text{winter}} = \begin{pmatrix}
0 \\
0 \\
0 \\
1
\end{pmatrix}
$$

This is known as One-hot encoding.

Multi-class Cross Entropy

We also need to change cross entropy to adapt to the multi-class problem. Before we do that, we need to change the output of our neural network accordingly. In binary classification the output was a probability of the label belonging to a class. Here we need multiple outputs producing the probability of each class.

Softmax

We used sigmoid function for a binary classifier to convert the output to a probability. We use a similar trick in the multi-class problem. The idea is to normalize each of the output with the sum of the outputs. In order to that we need all the outputs to be positive, as probability cannot be negative. So, we change each output $y_i$ to $e^{y_i}$. Now, we can normalize and get the probability of each output. This procedure is known as softmax. Let’s see a concrete example from the season problem defined above

$$\hat{Y} = \begin{pmatrix}
\hat{y_1} \\
\hat{y_2} \\
\hat{y_3} \\
\hat{y_4}
\end{pmatrix}
\begin{matrix}
\rightarrow \text{ this corresponds to spring} \\
\rightarrow \text{ this corresponds to summer} \\
\rightarrow \text{ this corresponds to autumn} \\
\rightarrow \text{ this corresponds to winter}
\end{matrix} \\
softmax(\hat{Y}) = \begin{pmatrix}
\dfrac{e^{\hat{y_1}}}{e^{\hat{y_1}}+e^{\hat{y_2}}+e^{\hat{y_3}}+e^{\hat{y_4}}} \\
\dfrac{e^{\hat{y_2}}}{e^{\hat{y_1}}+e^{\hat{y_2}}+e^{\hat{y_3}}+e^{\hat{y_4}}} \\
\dfrac{e^{\hat{y_3}}}{e^{\hat{y_1}}+e^{\hat{y_2}}+e^{\hat{y_3}}+e^{\hat{y_4}}} \\
\dfrac{e^{\hat{y_4}}}{e^{\hat{y_1}}+e^{\hat{y_2}}+e^{\hat{y_3}}+e^{\hat{y_4}}}
\end{pmatrix}
\begin{matrix}
\dfrac{\rightarrow \text{ probability of output being spring}}{} \\
\dfrac{\rightarrow \text{ probability of output being summer}}{} \\
\dfrac{\rightarrow \text{ probability of output being autumn}}{} \\
\dfrac{\rightarrow \text{ probability of output being winter}}{}
\end{matrix}
$$

With this, we’ve all the necessary elements for training and testing neural networks. From the next post, we’ll look at common problems and their respective solutions. Don’t forget your comic.

XKCD Wisdom of the ancients