Perceptron learning algorithm doesn't work

Question

I'm writing a perceptron learning algorithm on simulated data. However the program runs into infinite loop and weight tends to be very large. What should I do to debug my program? If you can point out what's going wrong, it'd be also appreciated.

What I'm doing here is first generate some data points at random and assign label to them according to the linear target function. Then use perceptron learning to learn this linear function. Below is the labelled data if I use 100 samples.

Also, this is Exercise 1.4 on book Learning from Data.

import numpy as np

a = 1
b = 1

def target(x):
    if x[1]>a*x[0]+b:
        return 1
    else:
        return -1

def gen_y(X_sim):
    return np.array([target(x) for x in X_sim])

def pcp(X,y):
    w = np.zeros(2)
    Z = np.hstack((X,np.array([y]).T))
    while ~all(z[2]*np.dot(w,z[:2])>0 for z in Z): # some training sample is missclassified
        i = np.where(y*np.dot(w,x)<0 for x in X)[0][0] # update the weight based on misclassified sample
        print(i)
        w = w + y[i]*X[i]
    return w

if __name__ == '__main__':
    X = np.random.multivariate_normal([1,1],np.diag([1,1]),20)
    y = gen_y(X)
    w = pcp(X,y)
    print(w)

The w I got is going to infinity.

[-1.66580705  1.86672845]
[-3.3316141   3.73345691]
[-4.99742115  5.60018536]
[-6.6632282   7.46691382]
[-8.32903525  9.33364227]
[ -9.99484231  11.20037073]
[-11.66064936  13.06709918]
[-13.32645641  14.93382763]
[-14.99226346  16.80055609]
[-16.65807051  18.66728454]
[-18.32387756  20.534013  ]
[-19.98968461  22.40074145]
[-21.65549166  24.26746991]
[-23.32129871  26.13419836]
[-24.98710576  28.00092682]
[-26.65291282  29.86765527]
[-28.31871987  31.73438372]
[-29.98452692  33.60111218]
[-31.65033397  35.46784063]
[-33.31614102  37.33456909]
[-34.98194807  39.20129754]
[-36.64775512  41.068026  ]

The textbook says:

The question is here:

Aside question: I actually don't get why this update rule will work. Is there a good geometric intuition of how this works? Clearly the book has given none. The update rule is simply w(t+1)=w(t)+y(t)x(t) wherever x,y is misclassified i.e. y!=sign(w^T*x)

Following one of the answer,

import numpy as np

np.random.seed(0)

a = 1
b = 1

def target(x):
    if x[1]>a*x[0]+b:
        return 1
    else:
        return -1

def gen_y(X_sim):
    return np.array([target(x) for x in X_sim])

def pcp(X,y):
    w = np.ones(3)
    Z = np.hstack((np.array([np.ones(len(X))]).T,X,np.array([y]).T))
    while not all(z[3]*np.dot(w,z[:3])>0 for z in Z): # some training sample is missclassified

        print([z[3]*np.dot(w,z[:3])>0 for z in Z])
        print(not all(z[3]*np.dot(w,z[:3])>0 for z in Z))

        i = np.where(z[3]*np.dot(w,z[:3])<0 for z in Z)[0][0] # update the weight based on misclassified sample
        w = w + Z[i,3]*Z[i,:3]

        print([z[3]*np.dot(w,z[:3])>0 for z in Z])
        print(not all(z[3]*np.dot(w,z[:3])>0 for z in Z))

        print(i,w)
    return w

if __name__ == '__main__':
    X = np.random.multivariate_normal([1,1],np.diag([1,1]),20)
    y = gen_y(X)
    # import matplotlib.pyplot as plt
    # plt.scatter(X[:,0],X[:,1],c=y)
    # plt.scatter(X[1,0],X[1,1],c='red')
    # plt.show()
    w = pcp(X,y)
    print(w)

This is still not working and prints

[False, True, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False, False, False]
True
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
0 [ 0.         -1.76405235 -0.40015721]
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
0 [-1.         -4.52810469 -1.80031442]
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
0 [-2.         -7.29215704 -3.20047163]
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True
0 [ -3.         -10.05620938  -4.60062883]
[True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True]
True

It seems that 1. only the three +1 are false, this is indicated in below graph. 2. index returned by a premise similar to Matlab find is wrong.

Can't comment on the code apart from the fact that perceptron doesn't guarantee convergence. So if your example can't be separated by linear hyperplane it will just take forever and never stop unless you stop it yourself. Anyway, the geometrical meaning is well explained here in my opinion: https://www.coursera.org/learn/neural-networks/lecture/sPEhK/a-geometrical-view-of-perceptrons-6-min — Maksim Khaitovich, Feb 22 '18 at 23:45
It is generated in the way that it can be separated. The cut-off point is exactly a line. More specifically, it's the line `y=x+1` — ZHU, Feb 22 '18 at 23:47
Can you print out the error and weights at each step? Is the error increasing or decreasing; are the weights getting closer to their expected values? — ddg, Feb 22 '18 at 23:49
You should randomly assign weights instead of starting them at zero. — Evan Weissburg, Feb 22 '18 at 23:55
@EvanWeissburg I don't think this is the problem, also this does not fix the issue. I found a problem of labelling `y` into `+1,-1` instead of `+1,0`. Not it's fixed but the weight now goes to infinity and index `i` is always `0` — ZHU, Feb 23 '18 at 00:01
@karlphillip it's generated. Just run the program in your own computer. — ZHU, Feb 23 '18 at 02:14

Brent Bradburn · Answer 1 · 2019-11-22T12:53:53.543

5

The step you missed from the book is in the top paragraph:

...where the added coordinate x0 is fixed at x0=1

In other words, they are telling you to add an entire column of ones as an additional feature to your dataset:

X = np.hstack(( np.array([ np.ones(len(X)) ]).T, X )) ## add a '1' column for bias

Correspondingly, you need three weight values instead of two: w = np.ones(3), and it may help to initialize these value to non-zero, perhaps depending on some of your other logic.

There are, I think, also some bugs in your code related to the while and where operations in the pcp function. It can be really difficult to get things right if you aren't used to using the implicit array programming style. It might be easier to replace those with more explicit iterative logic if you are having problems.

As far as the intuition goes, the book attempts to cover that with:

This rule moves the boundary in the direction of classifying x(t) correctly...

In other words: if your weight vector results in a wrong sign for a data element, adjust the vector accordingly in the opposite direction.

Another point to recognize here is that you have a two degrees of freedom in your input domain, and that means you need three degrees of freedom in your weights (as I mentioned earlier).

See also the standard form of representing a two-dimensional linear equation: Ax + By = C*1

This is perhaps surprising since the problem is presented in terms of finding a simple line, which has only two degrees of freedom, and yet your algorithm should be computing three weights.

One way of resolving this, mentally, might be to realize that your 'y' values were computed based on a slice of a hyperplane, computed over that 2-dimensional input domain. The fact that the plotted decision line appears to so perfectly represent the slope-intercept values that you selected is merely an artifact of the particular formula that was used to generate it: 0 > a*X0 - 1*X1 + b

My mentions of a hyperplane were references to a classification plane dividing an arbitrarily dimensional space -- as described here. It would have been simpler, in this case, to just use the term plane, since I was talking about a mere 3-dimensional space. But more generally, the classification boundary is called a hyperplane -- since you could easily have more than two input features (or fewer).

When programming with array-based math, it is critical to review the structure of the output after each statement that performs an assignment. If you aren't experienced with this syntax, it is very difficult to get things right on the first attempt.

Fixing the usage of where:

>>> A=np.array([3,5,7,11,13])
>>> np.where(z>10 for z in A) ## this was wrong
(array([0]),)
>>> np.where([z>10 for z in A]) ## this seems to work
(array([3, 4]),)

The while and the where in this code are doing redundant work that could easily be shared for improved speed. To further improve speed, you might think about how to avoid evaluating everything from scratch after each little adjustment.

edited Nov 22 '19 at 12:53

answered Feb 23 '18 at 03:04

Brent Bradburn

51,587
17
154
173

I think you're right about missing the intercept. For array programming, I'm not very sure why this isn't array programming. For example, in Matlab the condition in `where` would be something like `find(X(:)<0)` and the condition in `while` would be something like `~all(X(:)<0)`. I'm more used to Matlab and tend to translate the language over. But I guess this is not the best idea? I am also not very sure about the hyperplane, since I completely followed the textbook. I updated the textbook where you can see exactly where the question comes from (exercise 1.4). – ZHU Feb 23 '18 at 03:52
@ZHU: Regarding the *hyperplane* comments: I believe you did that portion of the problem correctly. I didn't mean to imply otherwise -- I was just just trying to provide a description that would help resolve what I think is a confusing aspect of the exercise: It seems like the exercise is to solve for a line in 2D space, but it is better described as trying to approximate a hyperplane in 3D space -- except that the hyperplane is sliced by the *classification threshold*, which makes it appear to be a line. Long story short, this is why you need three weights instead of two. – Brent Bradburn Feb 23 '18 at 04:10
I see! But it seems it's still not working (still blowing up). I'll update the code – ZHU Feb 23 '18 at 04:15
@ZHU: Regarding the array programming syntax: I can't vouch for the similarities between MATLAB and Python. I've done them both in the past, but I couldn't do either now without spending a lot of time searching the Internet for hints. My assumption is that neither is very intuitive -- that's why I suggested write out the loops more explicitly. The other alternative is to print out the results and carefully examine them to make sure they are doing what you expect. – Brent Bradburn Feb 23 '18 at 04:16
@ZHU: Here's a debugging hint, now that you have updated the code: Change `print(w)` to `print i,w`. I think you will see that `i` is always `0` -- that is one of the bugs that I mentioned. Look at `y*np.dot(w,x)`. `y` is an array, but I think you want an individual value here. Another note: You can make your life a little easier, for debugging, by setting `np.random.seed(0)` at the beginning of your program -- so that it generates the same data set every time you run. – Brent Bradburn Feb 23 '18 at 04:35
This makes it a little better. Now the bias and the weight on `x[1]` grows to infinity. I replaced the former update by the newest one. – ZHU Feb 23 '18 at 04:45
@ZHU: Try changing `w = w + ` to `w += `. These should be equivalent, but they are not in this case. – Brent Bradburn Feb 23 '18 at 05:46
It doesn't work, still. It seems that 1. only the three `+1` are always false. 2. index returned by a premise similar to Matlab `find` is wrong. – ZHU Feb 23 '18 at 05:47
@ZHU: The usage of `where` is not straightforward. That's why I tried to convince you not to use it. :) Here's another hint though: Change `where( expr )` to `where([ expr ])`. – Brent Bradburn Feb 23 '18 at 06:23
Thanks! Now it works perfectly. It seems to me that many people encourage vectorization to speed-up performance. Do you think this is not a very good place to start since the language is not intuitive? My hope is just to be as efficient as possible. But debugging non-intuitive expression is really a pain. I an imagine first write everything in for-loop than optimize the code can also be a pain.. – ZHU Feb 23 '18 at 06:32
Also, do you think I can further optimize the code so that it doesn't search the index two times (both `while` and `where`), but instead only do it once? – ZHU Feb 23 '18 at 06:35
Either way, you probably have to do some experimenting and debugging. For me, using vectorization is more work -- but maybe that's just because I don't practice much. In Python, it is *definitely* faster to vectorize -- in the typical case. – Brent Bradburn Feb 23 '18 at 06:35
Sure! You can at least do the dot product work only once by sharing the result, but you can probably look at the size of the result from `where` -- if it's an empty array, then you're done. You might have to use a `while True:` construct with a `break` to exit the loop. – Brent Bradburn Feb 23 '18 at 06:39

Jai · Answer 2 · 2018-02-23T06:29:18.493

Perceptron Rule always works if the data is linearly separable
It is guaranteed that Perceptron Rule will converge for linearly seperable data
Perceptron best works when the activation function is hyperbolic tangent or hardlimit
If you are using a linear activation then the weights will explode
you need to use regularization on weights so that weights don't become too large
The weight update rule is new_weight = old_weight + (target - logits) * input
In above weight update rule error = (target - logits)
The weight update rule mentioned is called as delta rule
In delta rule you can also use learning rate : new_weight = old_weight + learning_rate * (target - logits) * input
You are using weight update rule as: new_weight = old_weight + (logits) * input
The weight update that you using won't perform well.
Use delta rule
In your weight update rule you are not using target and hence it is unsupervised hebb rule
Please refer this github link: https://github.com/jayshah19949596/Neural-Network-Demo/tree/master/Single%20Neuron%20Perceptron%20Learning
This link exactly does what you want with the gui

Updated : Refer the code below .... I have trained for 100 epoochs

The weights may go to infinity if used a linear activation function
To avoid weights going to infinity either apply regularization or limit your weights to a certain number
In below code I have limited the value of weights and did not apply regularization

==============================================

import numpy as np
import matplotlib.pyplot as plt


def plot_line(x_val, y_val, points):
    fig = plt.figure()
    plt.scatter(points[0:2, 0], points[0:2, 1], figure=fig, marker="v")
    plt.scatter(points[2:, 0], points[2:, 1], figure=fig, marker="o")

    plt.plot(x_val, y_val, "--", figure=fig)
    plt.show()


def activation(net_value, activation_function):
    if activation_function == 'Sigmoid':
        # =============================
        # Calculate Sigmoid Activation
        # =============================
        activation = 1.0 / (1 + np.exp(-net_value))

    elif activation_function == "Linear":
        # =============================
        # Calculate Linear Activation
        # =============================
        activation = net_value

    elif activation_function == "Symmetrical Hard limit":
        # =============================================
        # Calculate Symmetrical Hard limit Activation
        # =============================================
        if net_value.size > 1:
            activation = net_value
            activation[activation >= 0] = 1.0
            activation[activation < 0] = -1.0
        # =============================================
        # If net value is single number
        # =============================================
        elif net_value.size == 1:
            if net_value < 0:
                activation = -1.0
            else:
                activation = 1.0

    elif activation_function == "Hyperbolic Tangent":
        # =============================================
        # Calculate Hyperbolic Tangent Activation
        # =============================================
        activation = ((np.exp(net_value)) - (np.exp(-net_value))) / ((np.exp(net_value)) + (np.exp(-net_value)))

    return activation

# ==============================
# Initializing weights
# ==============================
input_weight_1 = 0.0
input_weight_2 = 0.0
bias = 0.0
weights = np.array([input_weight_1, input_weight_2])

# ==============================
# Choosing random data points
# ==============================
data_points = np.random.randint(-10, 10, size=(4, 2))
targets = np.array([1.0, 1.0, -1.0, -1.0])

outer_loop = False
error_array = np.array([5.0, 5.0, 5.0, 5.0])

# ==========================
# Training starts from here
# ==========================
for i in range(0, 100):
    for j in range(0, 4):
        # =======================
        # Getting the input point
        # =======================
        point = data_points[j, :]

        # =======================
        # Calculating net value
        # =======================
        net_value = np.sum(weights * point) + bias  # [1x2] * [2x1]

        # =======================
        # Calculating error
        # =======================
        error = targets[j] - activation(net_value, "Symmetrical Hard limit")
        error_array[j] = error

        # ============================================
        # Keeping the error in range from -700 to 700
        # this is to avoid nan or overflow error
        # ============================================
        if error > 1000 or error < -700:
            error /= 10000

        # ==========================
        # Updating Weights and bias
        # ==========================
        weights += error * point
        bias += error * 1.0  # While updating bias input is always 1

        ###########################################################
        # If you want to use unsupervised hebb rule then use the below update rule
        # weights += targets[j] * point
        # bias += targets[j] * 1.0  # While updating bias input is always 1
        ###########################################################
        if (error_array == np.array([0.0, 0.0, 0.0, 0.0])).all():
            outer_loop = True
            break
    x_values = np.linspace(-10, 10, 256)

    if weights[0] == 0:
        weights[0] = 0.1

    if weights[1] == 0:
        weights[1] = 0.1

    # ========================================================
    # Getting the y values to plot a linear decision boundary
    # ========================================================
    y_values = ((- weights[0] * x_values) - bias) / weights[1]  # Equation of a line
    input_weight_1 = weights[0]
    input_weight_2 = weights[1]

    if outer_loop:
        break

input_weight_1 = weights[0]
input_weight_2 = weights[1]
print(weights)
plot_line(x_values, y_values, data_points)

==============================================

The Output is:

======================================================

Using your code with my code

import numpy as np
import matplotlib.pyplot as plt


def plot_line(x_val, y_val, targets, points):
    fig = plt.figure()
    for i in range(points.shape[0]):
        if targets[i] == 1.0:
            plt.scatter(points[i, 0], points[i, 1], figure=fig, marker="v", c="red")
        else:
            plt.scatter(points[i, 0], points[i, 1], figure=fig, marker="o", c="black")
    plt.plot(x_val, y_val, "--", figure=fig)
    plt.show()


def activation(net_value, activation_function):
    if activation_function == 'Sigmoid':
        # =============================
        # Calculate Sigmoid Activation
        # =============================
        activation = 1.0 / (1 + np.exp(-net_value))

    elif activation_function == "Linear":
        # =============================
        # Calculate Linear Activation
        # =============================
        activation = net_value

    elif activation_function == "Symmetrical Hard limit":
        # =============================================
        # Calculate Symmetrical Hard limit Activation
        # =============================================
        if net_value.size > 1:
            activation = net_value
            activation[activation >= 0] = 1.0
            activation[activation < 0] = -1.0
        # =============================================
        # If net value is single number
        # =============================================
        elif net_value.size == 1:
            if net_value < 0:
                activation = -1.0
            else:
                activation = 1.0

    elif activation_function == "Hyperbolic Tangent":
        # =============================================
        # Calculate Hyperbolic Tangent Activation
        # =============================================
        activation = ((np.exp(net_value)) - (np.exp(-net_value))) / ((np.exp(net_value)) + (np.exp(-net_value)))

    return activation


a = 1
b = 1


def target(x):
    if x[1] > a*x[0]+b:
        return 1
    else:
        return -1


def gen_y(X_sim):
    return np.array([target(x) for x in X_sim])


def train(data_points, targets, weights):
    outer_loop = False
    error_array = np.zeros_like(targets) + 0.5
    bias = 0

    # ==========================
    # Training starts from here
    # ==========================
    for i in range(0, 1000):
        for j in range(0, data_points.shape[0]):
            # =======================
            # Getting the input point
            # =======================
            point = data_points[j, :]

            # =======================
            # Calculating net value
            # =======================
            net_value = np.sum(weights * point) + bias  # [1x2] * [2x1]

            # =======================
            # Calculating error
            # =======================
            error = targets[j] - activation(net_value, "Symmetrical Hard limit")
            error_array[j] = error

            # ============================================
            # Keeping the error in range from -700 to 700
            # this is to avoid nan or overflow error
            # ============================================
            if error > 1000 or error < -700:
                error /= 10000

            # ==========================
            # Updating Weights and bias
            # ==========================
            weights += error * point
            bias += error * 1.0  # While updating bias input is always 1

            ###########################################################
            # If you want to use unsupervised hebb rule then use the below update rule
            # weights += targets[j] * point
            # bias += targets[j] * 1.0  # While updating bias input is always 1
            ###########################################################
            # if error_array.all() == np.zeros_like(error_array).all():
            #     outer_loop = True
            #     break
        x_values = np.linspace(-10, 10, 256)

        if weights[0] == 0:
            weights[0] = 0.1

        if weights[1] == 0:
            weights[1] = 0.1

        # ========================================================
        # Getting the y values to plot a linear decision boundary
        # ========================================================
        y_values = ((- weights[0] * x_values) - bias) / weights[1]  # Equation of a line

        if outer_loop:
            break

    plot_line(x_values, y_values, targets, data_points)


def pcp(X, y):
    w = np.zeros(2)
    Z = np.hstack((X, np.array([y]).T))
    X = Z[0:, 0:2]
    Y = Z[0:, 2]
    train(X, Y, w)
    # while ~all(z[2]*np.dot(w, z[:2]) > 0 for z in Z):  # some training sample is miss-classified
    #     i = np.where(y*np.dot(w, x) < 0 for x in X)[0][0]  # update the weight based on misclassified sample
    #     print(i)
    #     w = w + y[i]*X[i]
    return w


if __name__ == '__main__':
    X = np.random.multivariate_normal([1, 1], np.diag([1, 1]), 20)
    y = gen_y(X)
    w = pcp(X, y)
    print(w)

I get the following output

I added a snapshot of the textbook. I followed the textbook entirely. There seems to have someone using the same update-rule and also works https://stackoverflow.com/questions/34477827/intuition-for-perceptron-weight-update-rule — ZHU, Feb 23 '18 at 02:18
As I mentioned you can use unsupervised hebb rule but it does not perform well as delta rule does...Anyways which ever rule you use you have to use hard limit activation function ... In your case you are using linear activation... Linear activation explodes the weights ... Refer my github link and run that code you will understand everything... Linear activation does not perform as well as hardlimit does — Jai, Feb 23 '18 at 02:25
If you can give an answer in details how delta rule is compared with hebb rule it can be very helpful. — ZHU, Feb 23 '18 at 02:54
What you mention might also work, however, there are bugs in my program that I couldn't found. — ZHU, Feb 23 '18 at 05:58
@ZHU ... I have updated my answer... I have combined your code with my code ...Check it out — Jai, Feb 23 '18 at 19:23

score 2 · Answer 3 · answered Feb 23 '18 at 03:29

The first thing I notice is that you're doing things in a very condensed way. I'd really recommend taking the long road when you're learning a new concept.

The perceptron algorithm is actually w(t+1) = w(t) + a*(t(i) - y(i))*x, where t(i) is the target or actual value, and y(i) is the algorithm's output. It is guaranteed to converge IF your data is linearly separable, which your data might barely not be. In addition to printing the weights at each iteration, you should also print out the number of misclassifications. If it is cycling around 1 or 2 misclassifications, your data is probably not separable. If the misclassifications are much higher than that, you should look at your implementation.

Next, the perceptron always needs a bias term, which is a weight with an input that is always 1. It acts as the b in y = mx+b from algebra. When you create Z, instead of making the third column your target (y) values, you should just append a bunch of ones, like this:

Z = np.hstack((X, np.ones(len(X))[...,None]))

You'll also need to increase the number of weights:

w = np.zeros(3)

In addition to needing another weight, it's best practice to initialize your weigths to random values. So we'll change that same line to be

w = np.random.rand(3)

I'm not sure what is happening with the ~all(...) section, so we're going to rewrite that. We're going to take into account the possibility that our data will not be separable, and add in a stopping point max_iter, but leave an exit if we do get to 0 error.

max_iter = 100
total_err = 100000 # Just really large
while total_err != 0 and max_iter > 0:
    total_err = 0

    for i in range(len(Z)):
    ...

The next issue is how you get your output. Instead of relying on sign mismatch, let's just calculate the difference between the target values (y), and the acual output of the neuron (np.dot(w,x)). This is one of the pieces from the perceptron algorithm above.

err = y[i] - np.dot(w,Z[i,:])
total_err += err / 2 # Because errors will be -2 or 2.

Finally, we get to update the weights. From the perceptron algorithm above, we need to:

w = w + err * Z[i,:].T # Transposed to match the shape of w

There are a few more quick improvements you could make to the algorithm. First, most people implement some sort of learning rate into the mix. Before the while loop, add a = 0.01 or something around that size. Change the perceptron learning rule to be

w = w + a * err * Z[i, :].T

This will keep your algorithm from jumping straight past the best set of weights.

Perceptron learning algorithm doesn't work

3 Answers3

The Output is: