Why is this Linear Classifier algorithm wrong?

Question

I specify 'n' amount of points. Label them +1 or -1. I store all this in a dictionary that looks like: {'point1' : [(0.565,-0.676), +1], ... }. I am trying to find a line that separates them - i.e. points labeled +1 above the line, those -1 below the line. Can anyone help?

I'm trying to apply w = w + y(r) as the "learning algorithm", w is the weight vector y is +1 or -1, r is the point

The code runs but the separating line is not precise - it doesn't separate correctly. Also, as I increase the number of points to separate, the line gets less efficient.

If you run the code, the green line is supposed to be the separating line. The closer it get to the slope of the blue line (the perfect line by definition), the better.

from matplotlib import pyplot as plt
import numpy as np
import random 

n = 4
x_values = [round(random.uniform(-1,1),3) for _ in range(n)]
y_values = [round(random.uniform(-1,1),3) for _ in range(n)]
pts10 = zip(x_values, y_values)
label_dict = {}


x1, y1, x2, y2 = (round(random.uniform(-1,1),3) for _ in range(4))
b = [x1, y1] 
d = [x2, y2]
slope, intercept = np.polyfit(b, d, 1)

fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(*zip(*pts10), color = 'black')
ax.plot(b,d,'b-')

label_plus = '+'
label_minus = '--'
i = 1                              
for x,y in pts10:
    if(y > (slope*x + intercept)):
        ax.annotate(label_plus, xy=(x,y), xytext=(0, -10), textcoords='offset points', color = 'blue', ha='center', va='center')
        label_dict['point{}'.format(i)] = [(x,y), "+1"]  
    else:
        ax.annotate(label_minus, xy=(x,y), xytext=(0, -10), textcoords='offset points', color = 'red', ha='center', va='center')
        label_dict['point{}'.format(i)] = [(x,y), "-1"]
    i += 1



# this is the algorithm
def check(ww,rr):
    while(np.dot(ww,rr) >= 0):
        print "being refined 1"
        ww = np.subtract(ww,rr)
    return ww
def check_two(ww,rr):
    while(np.dot(ww,rr) < 0):
        print "being refined 2"
        ww = np.add(ww,rr)
    return ww

w = np.array([0,0])
ii = 1
for x,y in pts10:
    r = np.array([x,y])
    print w
    if (np.dot(w,r) >= 0) != int(label_dict['point{}'.format(ii)][1]) < 0:
        print "Point " + str(ii) + " should have been below the line"
        w = np.subtract(w,r)  
        w = check(w,r)
    elif (np.dot(w,r) < 0) != int(label_dict['point{}'.format(ii)][1]) >= 0:
        print "Point " + str(ii) + " should have been above the line"
        w = np.add(w,r)
        w = check_two(w,r)
    else:
        print "Point " + str(ii) + " is in the correct position"
    ii += 1

ax.plot(w,'g--')


ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_title('Labelling 10 points')
ax.set_xticks(np.arange(-1, 1.1, 0.2))
ax.set_yticks(np.arange(-1, 1.1, 0.2))
ax.set_xlim(-1, 1)
ax.set_ylim(-1, 1)
ax.legend()

What exactly is wrong? Does it break somewhere? It runs but it doesn't classify? The question needs to be precise. Review the [asking](https://stackoverflow.com/help/asking) section and correct you question — gionni, Jul 22 '17 at 15:42
You need to solve an [optimization problem](https://en.wikipedia.org/wiki/Linear_classifier#Discriminative_training). Could you elaborate on how your algorithm works exactly? How do you expect to find an optimal "slope" (and intercept)? Note that `sklearn` has [many](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) [classifiers](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) for such tasks. — a_guest, Jul 22 '17 at 18:03
@a_guest I start with the weight vector `w = [0,0]` and and take the dot-product of w•r, where `r = [x1,y1]` of point1. If the dot-product is positive or zero (e.g. 5) and the label of point1 in the dictionary is +1, I leave it. If however the label is '-1', then this means it should be below the line. Whenever there is a discrepency between the dot-product and label I do: `w = w + (y)r` . Where `y` is the value of the label, i.e. +1 or -1. — Bn.F76, Jul 22 '17 at 19:11
@Bn.F76 As far as I see from your code the update depends on whether the dot product "matches" the label **and** the *value of the dot product* (not the sign of `y`). At least that's part of your `if` clauses. Also for determining if a point is "above" or "below" a line the dot product alone won't help (it indicates the unsigned angle). So you would rather need [`np.arctan2(np.cross(a, b), np.dot(a, b))`](https://stackoverflow.com/a/10145056/3767239) for two vectors `a`, `b`. In addition, how do you know that this algorithm leads to an optimum value for `w`? — a_guest, Jul 22 '17 at 19:24

a_guest · Answer 1 · 2017-07-22T19:28:34.803

You can for example use the SGDClassifier from scikit-learn (sklearn). The linear classifiers compute predictions as follows (see the source code):

def predict(self, X):
        scores = self.decision_function(X)
        if len(scores.shape) == 1:
            indices = (scores > 0).astype(np.int)
        else:
            indices = scores.argmax(axis=1)
    return self.classes_[indices]

where the decision_function is given by:

def decision_function(self, X):
        [...]

        scores = safe_sparse_dot(X, self.coef_.T,
                                 dense_output=True) + self.intercept_
    return scores.ravel() if scores.shape[1] == 1 else scores

So for the two-dimensional case of your example this means that a data point is classified +1 if

x*w1 + y*w2 + i > 0

where

x, y = X
w1, w2 = self.coef_
i = self.intercept_

and -1 otherwise. So the decision depends on x*w1 + y*w2 + i being greater than or less than (or equal to) zero. Thus the "border" is found by setting x*w1 + y*w2 + i == 0. We are free to choose one of the components and the other one is determined by this equation.

The following snippet fits a SGDClassifier and plots the resulting "border". It assumes that the data points are scattered around the origin (x, y = 0, 0), i.e. that their mean is (approximately) zero. Actually, in order to obtain good results, one should first subtract the mean from the data points, then perform the fit and then add the mean back to the result. The following snippet just scatters the points around the origin.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import SGDClassifier

n = 100
x = np.random.uniform(-1, 1, size=(n, 2))

# We assume points are scatter around zero.
b = np.zeros(2)
d = np.random.uniform(-1, 1, size=2)
slope, intercept = (d[1] / d[0]), 0.

fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(x[:, 0], x[:, 1], color = 'black')
ax.plot([b[0], d[0]], [b[1], d[1]], 'b-', label='Ideal')

labels = []
for point in x:
    if(point[1] > (slope * point[0] + intercept)):
        ax.annotate('+', xy=point, xytext=(0, -10), textcoords='offset points', color = 'blue', ha='center', va='center')
        labels.append(1)
    else:
        ax.annotate('--', xy=point, xytext=(0, -10), textcoords='offset points', color = 'red', ha='center', va='center')
        labels.append(-1)

labels = np.array(labels)
classifier = SGDClassifier()
classifier.fit(x, labels)

x1 = np.random.uniform(-1, 1)
x2 = (-classifier.intercept_ - x1 * classifier.coef_[0, 0]) / classifier.coef_[0, 1]

ax.plot([0, x1], [0, x2], 'g--', label='Fit')

plt.legend()
plt.show()

This plot shows the result for n = 100 data points:

The following plot shows the results for different n where the points have been chosen randomly from the pool which contains 1000 data points:

Thank you so much for your prompt and thorough reply! It helped me figure out the answer. — Bn.F76, Jul 23 '17 at 17:13

score 1 · Accepted Answer · answered Jul 23 '17 at 17:23

This is the answer I've come up with. Some notes I realised:
w = w + y(r) algorithm only works for normalised vectors. 'w' is the weight vector, 'r' is [x,y] of the point in question, 'y' is the sign of the label.
You can find the slope and intercept from the resulting vector 'w' by putting the coefficients in ax+by+c = 0 form and solving for 'y'.

w = np.array([0,0,0])
restart = True
while restart:  
    ii = 0
    restart = False
    for x,y in pts10:
        if(restart == False):
            ii += 1

    r = np.array([x,y,1])    
    if (np.dot(w,r) >= 0) and int(label_dict['point{}'.format(ii)][1]) >= 0:
        print "Point " + str(ii) + " is correctly above the line --> no adjustments"      
    elif (np.dot(w,r) < 0) and int(label_dict['point{}'.format(ii)][1]) < 0:
        print "Point " + str(ii) + " is correctly below the line --> no adjustments"        
    elif (np.dot(w,r) >= 0) and int(label_dict['point{}'.format(ii)][1]) < 0:
        print "Point " + str(ii) + " should have been below the line"  
        w = np.subtract(w,r)
        restart = True
        break       
    elif (np.dot(w,r) < 0) and int(label_dict['point{}'.format(ii)][1]) >= 0:
        print "Point " + str(ii) + " should have been above the line"           
        w = np.add(w,r)
        restart = True
        break           
    else:
        print "THERE IS AN ERROR, A POINT PASSED THROUGH HERE"

print w
slope_w = (-w[0])/w[1] 
intercept_w = (-w[2])/w[1]

Why is this Linear Classifier algorithm wrong?

2 Answers2