Neuron and the Perceptron Algorithm

This is the first part of the series “Deep Learning Fundamentals”. The goal of this series is to explore the mechanisms of artificial neural networks. The focus is more on presenting an intuitive way of understanding neural networks. So, you can expect an emphasis on how and why things work rather than what does the job. More often than not I’ll try to use simple math without focusing on notation. Let’s jump into the fundamental unit of most of the neural networks - “a neuron”.

Neuron

A neuron is nothing but a binary classifier. Which means given an n-dimensional space, it can divide the space into two regions. Let’s try to come up with an algorithm that does this, given that the data is linearly separable. For the sake of brevity, let’s consider that the data we deal with is numerical across all dimensions. (Dealing with other kinds of data is for another post.)

Without loss of generality let’s start working with two-dimensional data. So, let’s say there are some green(0) and blue(1) points on a plane and we want a line that separates these points. An intuitive way would be to draw a line like shown in the graph below. So far so good. But what we need finally is an algorithm that does this for us all this by itself and adjusts to any new points.

The Search for a Classifier

Let’s start with a random line in this space and then see what we must do, for creating the two regions. Let’s try it ourselves first. One way would be to rotate and translate the line until it separates the two regions. This seems intuitive enough. Play around with the interactive graph below to separate the two groups of points. Make sure the points are in their respective colored regions.

You would’ve noticed that $a$ and $b$ contribute to the rotation and $c$ contributes to translation. This makes sense because $a$ and $b$ control the slope and $c$ controls the distance from the origin. But how did you know when to stop fiddling around with those parameters? A trivial but useful answer would be to stop when there are no misclassified points.

At this point, we are good to develop a general algorithm for classifying two groups of points. We would do something like this

Initialize random line parameters

while misclassified points are present:

    fiddle around with the line parameters

The perceptron trick

Let’s start making the algorithm a little more concrete. Let the line be $ax_1 + bx_2 + c = 0$, where we initialize $a$, $b$, $c$ at random. For classifying the points, let’s first give the points some labels(0 and 1) to differentiate them. Let $P(x_1, x_2)$ be a point in our data space with a label $L_P$. If $(ax_1+ bx_2 + c > 0) = L_P$ then $P$‘s classification is correct.

Now let’s deal with the fiddling part of the algorithm. Before we see any math on this, let’s try to build our intuition for it. Below are two cases

case 1: $0$ label point is misclassified.
case 2: $1$ label point is misclassified.

See if you can come up with a general rule for modifying $a$, $b$ and $c$ that leads to the correct classification of these points.

With a bit of effort, you can observe that the following pattern works for any situation

increase parameters for a point labeled $1$ and classified as $0$.
decrease parameters for a point labeled $0$ and classified as $1$.

Don’t worry if you haven’t found this out. Try it now and see if it works. The reason you might have missed the pattern is that you were trying to

increase one parameter and decrease another.
increase or decrease the parameters by a random amount until it does the job.

There’s nothing wrong with the above two operations because they do get the job done. But the problem is that we can’t generalize those operations for any given scenario. So, by this point, you should have a decent amount of intuition on how all this works. So, here’s how you can update the parameters. The amount by which they have to be updated is directly proportional to $x_1$ for $a$, $x_2$ for $b$. This translates to the following formula which works for both kinds of misclassification.

For every misclassified point $P(x_1, x_2)$:
$\hspace{1cm}a = a + \alpha (expected - predicted) x_1$
$\hspace{1cm}b = b + \alpha (expected - predicted) x_2$
$\hspace{1cm}c = c + \alpha (expected - predicted)$

Here $\alpha$ is the proportionality constant and is like a fine-tuning knob. It controls the rate at which we change these parameters, also known as the “learning rate”. We need our perceptron to change the parameters in a slow manner. Because large changes often tend to misclassify points that were correctly classified before. There aren’t any fixed good values for the learning rate. See for yourself how learning rate affects the speed and performance here.

In a Deep Learning setting the parameters $a$, $b$ are usually denoted by $w_1$, $w_2$ and are part of a vector $w$. $c$ is known as bias. The inputs and outputs of a neuron are as shown in the figure below.

$$z = w_0 + w_1x_1 + w_2x_2 + … + w_mx_m$$

$$
H(z) =
\begin{cases}
0 & {n < 0} \\
1 & {n \geq 0}
\end{cases}
$$
Perceptron

The Perceptron Algorithm

To wrap up this section, here’s the formal definition in a deep learning setting.

Data: Training Data:$(x_i , y_i )$; $\forall i \in {0, 1, 2, . . . , N }$, Learning Rate: $\eta$, where

$x_i$ is a m-dimensional input vector and $N$ is the total number of instances of our data.
$x_{i, 0} = 1;$ $\forall i \in {0, 1, 2, . . . , N }$
${\hat y}_i$ is the prediction of a point $(x_i , y_i )$

Result: Separating Hyper-plane coefficients :$w^∗$
Initialize $w$ ← random weights ; (Since $x_{i, 0} = 1, w_0$ acts as the bias without setting it explicitly.)
repeat

get example $(x_i , y_i )$;
$\hspace{1cm}\hat y_i ← w^T x_i $;
$\hspace{1cm}w ← w + \eta(y_i − {\hat y}_i )x_i$

until convergence;

Here’s your dopamine shot for making it till the end.

XKCD comic