The first part of the series of “Statistical Learning Methods” notes corresponds to the second chapter of the original work. A lot of references to the original explanation, joined their own understanding. The algorithm in the book is implemented in Python, and the animation is visualized with Matplotlib, which should be considered very hard. A dry goods is very hard, and it will be good if you can stick to it.

## Concept

The perceptron is a two-category model that inputs the feature vector of the instance and the ± category of the output instance.

Perceptron model

### Definition

Suppose the input space is

, the output space is

, x and y belong to these two spaces, then the following function from the input space to the output space:

It is called a perceptron. Where w and b are called perceptron model parameters,

Called a weight or weight vector,

Called offset, w·x represents the inner product of vectors w and x. Sign is a function:

The geometric interpretation of the perceptron is a linear equation

Divide the feature space into two parts: positive and negative:

This plane (degraded to a straight line in 2D) is called a separation hyperplane.

## Perceptron learning strategy

Linear separability of data sets

### definition

Given data set

among them

If there is a hyperplane S

It is possible to completely divide the positive and negative instance points completely correctly, so that T is linearly separable, otherwise T is linearly inseparable.

Perceptron learning strategy

Assuming the data set is linearly separable, we want to find a reasonable loss function.

A naive idea is to use the total number of misclassification points, but such a loss function is not a continuous derivable function of the parameters w, b, and it is not natural to grasp the change of the function, so it is not easy to optimize (do not know when to terminate the training, Or the timing of termination is not optimal).

Another idea is to choose the total distance from all misclassified points to the hyperplane S. To do this, first define the distance from point x0 to plane S:

Denominator

Is the L_{2 }norm of w, the so-called L_{2} norm, which refers to the sum of the squares of the elements of the vector and then the square root (length). This formula is well understood, recalling the distance from the point learned in the middle to the plane:

The geometric meaning of the distance from the point to the hyperplane S here is the generalization of the above distance in the multidimensional space.

And because, if point i is misclassified, there must be

Established, so we removed the absolute value symbol and got the formula for the distance from the misclassification point to the hyperplane S:

Assuming that all misclassification points constitute a set M, then the total distance of all misclassified points to the hyperplane S is

The denominator has little effect, and it must be positive anyway. Without considering the denominator, we get the loss function of perceptron learning:

## Perceptron learning algorithm

### Original form

The perceptron learning algorithm is an algorithm for the following optimization problems:

The perceptron learning algorithm is driven by misclassification. First, a hyperplane is randomly selected, and then the loss function is continuously minimized by the gradient descent method. The gradient of the loss function is made up of:

Given. The so-called gradient is a vector that points to the direction in which the scalar field grows fastest and the length is the maximum rate of change. The so-called scalar field means that the attribute of any point in space can be represented by a scalar field (the individual understands the scalar as the output of the function).

Randomly select a misclassification point i to update the parameters w, b:

Upper formula

Is the learning rate. The parameter of the loss function is added to the opposite direction of the gradient rise, and the gradient is then lowered. Therefore, the above iteration can cause the loss function to continue to decrease until it is zero. Then the original form of perceptron learning algorithm is obtained:

For this algorithm, use the following example as the test data:

Give the Python implementation and visualization code as follows:

### Perceptron algorithm code

Finally, at the most exciting moment, with this knowledge, you can perfectly visualize this simple algorithm:

# -*- coding:utf-8 -*- # Filename: train2.1.py # Author：hankcs # Date: 2015/1/30 16:29 import copy from matplotlib import pyplot as plt from matplotlib import animation training_set = [[(3, 3), 1], [(4, 3), 1], [(1, 1), -1]] w = [0, 0] b = 0 history = [] def update(item): """ update parameters using stochastic gradient descent :param item: an item which is classified into wrong class :return: nothing """ global w, b, history w[0] += 1 * item[1] * item[0][0] w[1] += 1 * item[1] * item[0][1] b += 1 * item[1] print w, b history.append([copy.copy(w), b]) # you can uncomment this line to check the process of stochastic gradient descent def cal(item): """ calculate the functional distance between 'item' an the dicision surface. output yi(w*xi+b). :param item: :return: """ res = 0 for i in range(len(item[0])): res += item[0][i] * w[i] res += b res *= item[1] return res def check(): """ check if the hyperplane can classify the examples correctly :return: true if it can """ flag = False for item in training_set: if cal(item) <= 0: flag = True update(item) # draw a graph to show the process if not flag: print "RESULT: w: " + str(w) + " b: " + str(b) return flag if __name__ == "__main__": for i in range(1000): if not check(): break # first set up the figure, the axis, and the plot element we want to animate fig = plt.figure() ax = plt.axes(xlim=(0, 2), ylim=(-2, 2)) line, = ax.plot([], [], 'g', lw=2) label = ax.text([], [], '') # initialization function: plot the background of each frame def init(): line.set_data([], []) x, y, x_, y_ = [], [], [], [] for p in training_set: if p[1] > 0: x.append(p[0][0]) y.append(p[0][1]) else: x_.append(p[0][0]) y_.append(p[0][1]) plt.plot(x, y, 'bo', x_, y_, 'rx') plt.axis([-6, 6, -6, 6]) plt.grid(True) plt.xlabel('x') plt.ylabel('y') plt.title('Perceptron Algorithm (www.hankcs.com)') return line, label # animation function. this is called sequentially def animate(i): global history, ax, line, label w = history[i][0] b = history[i][1] if w[1] == 0: return line, label x1 = -7 y1 = -(b + w[0] * x1) / w[1] x2 = 7 y2 = -(b + w[0] * x2) / w[1] line.set_data([x1, x2], [y1, y2]) x1 = 0 y1 = -(b + w[0] * x1) / w[1] label.set_text(history[i]) label.set_position([x1, y1]) return line, label # call the animator. blit=true means only re-draw the parts that have changed. print history anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(history), interval=1000, repeat=True, blit=True) plt.show() anim.save('perceptron.gif', fps=2, writer='imagemagick')

### Visualization

It can be seen that the hyperplane is attracted by the misclassification point and moves toward it, so that the distance between the two is gradually reduced until it is correctly classified. Through this animation, is there a more intuitive understanding of the gradient reduction algorithm of the perceptron?

### Algorithm convergence

Write the input vector to the constant form of constant 1

, its maximum length is

, the parameter vector of the perceptron

Set the conditions

The hyperplane can completely categorize the dataset and define the minimum gamma:

Then the number of misclassifications k satisfies:

For proof, please refer to the Statistical Learning Method P31.

### Dual form of perceptron learning algorithm

Duality means that w and b are expressed as a linear combination of test data i, and w and b are obtained by solving the coefficients. Specifically, if the misclassification point i is gradually modified, wb is modified n times, then the increments of w and b for i are respectively

,Here

, the final solved parameters are expressed as:

So there is Algorithm 2.2:

### Perceptron dual algorithm code

More matrix calculations are involved, so NumPy is used more:

# -*- coding:utf-8 -*- # Filename: train2.2.py # Author：hankcs # Date: 2015/1/31 15:15 import numpy as np from matplotlib import pyplot as plt from matplotlib import animation # An example in that book, the training set and parameters' sizes are fixed training_set = np.array([[[3, 3], 1], [[4, 3], 1], [[1, 1], -1]]) a = np.zeros(len(training_set), np.float) b = 0.0 Gram = None y = np.array(training_set[:, 1]) x = np.empty((len(training_set), 2), np.float) for i in range(len(training_set)): x[i] = training_set[i][0] history = [] def cal_gram(): """ calculate the Gram matrix :return: """ g = np.empty((len(training_set), len(training_set)), np.int) for i in range(len(training_set)): for j in range(len(training_set)): g[i][j] = np.dot(training_set[i][0], training_set[j][0]) return g def update(i): """ update parameters using stochastic gradient descent :param i: :return: """ global a, b a[i] += 1 b = b + y[i] history.append([np.dot(a * y, x), b]) # print a, b # you can uncomment this line to check the process of stochastic gradient descent # calculate the judge condition def cal(i): global a, b, x, y res = np.dot(a * y, Gram[i]) res = (res + b) * y[i] return res # check if the hyperplane can classify the examples correctly def check(): global a, b, x, y flag = False for i in range(len(training_set)): if cal(i) <= 0: flag = True update(i) if not flag: w = np.dot(a * y, x) print "RESULT: w: " + str(w) + " b: " + str(b) return False return True if __name__ == "__main__": Gram = cal_gram() # initialize the Gram matrix for i in range(1000): if not check(): break # draw an animation to show how it works, the data comes from history # first set up the figure, the axis, and the plot element we want to animate fig = plt.figure() ax = plt.axes(xlim=(0, 2), ylim=(-2, 2)) line, = ax.plot([], [], 'g', lw=2) label = ax.text([], [], '') # initialization function: plot the background of each frame def init(): line.set_data([], []) x, y, x_, y_ = [], [], [], [] for p in training_set: if p[1] > 0: x.append(p[0][0]) y.append(p[0][1]) else: x_.append(p[0][0]) y_.append(p[0][1]) plt.plot(x, y, 'bo', x_, y_, 'rx') plt.axis([-6, 6, -6, 6]) plt.grid(True) plt.xlabel('x') plt.ylabel('y') plt.title('Perceptron Algorithm 2 (www.hankcs.com)') return line, label # animation function. this is called sequentially def animate(i): global history, ax, line, label w = history[i][0] b = history[i][1] if w[1] == 0: return line, label x1 = -7.0 y1 = -(b + w[0] * x1) / w[1] x2 = 7.0 y2 = -(b + w[0] * x2) / w[1] line.set_data([x1, x2], [y1, y2]) x1 = 0.0 y1 = -(b + w[0] * x1) / w[1] label.set_text(str(history[i][0]) + ' ' + str(b)) label.set_position([x1, y1]) return line, label # call the animator. blit=true means only re-draw the parts that have changed. anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(history), interval=1000, repeat=True, blit=True) plt.show() # anim.save('perceptron2.gif', fps=2, writer='imagemagick')

### Visualization

As with the result of Algorithm 1, we can also change the data set:

training_set = np.array([[[3, 3], 1], [[4, 3], 1], [[1, 1], -1], [[5, 2], -1]])

Will get a more complicated result:

## After reading

Learn the common concepts and common processes in ML with the simplest models.

In addition, this article is only a personal note, serving personal memorandum, and does not guarantee quality and follow-up. Still, the blog only adds, to get started, or to read classics.

## Reference

Some of the code in this article refers to the implementation of OldPanda.