4

How do you handle graphs like this: enter image description here

using scikitlearn's LogisticRegression model. Is there a way to handle these sorts of problems easily using scikitlearn and a standard X, y input that maps to a graph like this?

Rob
  • 3,333
  • 5
  • 28
  • 71
  • 2
    LR can only handle linear separable classes in its raw form. – Kh40tiK Nov 17 '16 at 06:06
  • 1
    I do not think the data provided here is linearly seperable – backtrack Nov 17 '16 at 06:10
  • 2
    You cant use logistic for this, BUT you can use SVM with a radial (circular) kernel. http://scikit-learn.org/stable/modules/svm.html – Wboy Nov 17 '16 at 06:13
  • Ridge may be useful too. http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html but i am not sure – backtrack Nov 17 '16 at 06:14
  • If it is not ruled out by "standard X, y input" you could simply add higher order features to X. Assuming you have x, y coordinates, adding x^2, y^2, x^3, y^3, ... will allow a more complex contour (more complex the more higher order terms you add) – smernst Nov 17 '16 at 11:41

3 Answers3

2

A promising approach if you really want to use Logistic Regression for this particular setting would be to transform your coordinates from Cartesian system to Polar system. From the visualization, it seems that in that systems you data will be (almost) linearly separable.

This can be done as described here: Python conversion between coordinates

Community
  • 1
  • 1
geompalik
  • 1,582
  • 11
  • 22
2

There have been a couple of answers already, but neither of them have mentioned any preprocessing of the data. So I will show both ways of looking at your problem.

First up I'll look at some manifold learning to transform you data into another space

# Do some imports that I'll be using
from sklearn import datasets, manifold, linear_model
from sklearn import model_selection, ensemble, metrics
from matplotlib import pyplot as plt

%matplotlib inline

# Make some data that looks like yours
X, y = datasets.make_circles(n_samples=200, factor=.5,
                             noise=.05)

First of all let's look at your current problem

plt.scatter(X[:, 0], X[:, 1], c=y)
clf = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(clf, X, y)
print scores.mean()

Outputs:

Scatter plot of your data

0.440433749257

So you can see this data looks like yours and we get a terrible cross-validated accuracy with logistic regression. So if you're really attached the logistic regression, what we can do is project your data into a different space using some sort of manifold learning, for example:

Xd = manifold.LocallyLinearEmbedding().fit_transform(X)
plt.scatter(Xd[:, 0], Xd[:, 1], c=y)
clf = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(clf, Xd, y)
print scores.mean()

Outputs:

enter image description here

1.0

So you can see that now your data is perfectally linearly seperable from the LocallyLinearEmbedding we get a much better classifier accuracy!

The other option that is available to you, that's been mentioned by other people is using a different model. While there are many options avaiable to you, I'm just going to show an example using RandomForestClassifier. I'm only going to train on half the data so we can evaluate the accuracy on an unbias set. I only used CV previously because it's quick and easy!

clf = ensemble.RandomForestClassifier().fit(X[:100], y[:100])
print metrics.accuracy_score(y[100:], clf.predict(X[100:]))

Outputs:

0.97

So we're getting a good accuracy! If you're interested to see what's going on, we can lift some code from one of the awesome scikit-learn tutorials.

plot_step = 0.02
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y)

Outputs:

Decision boundry of RF classifier

So this shows the areas of your space that are being classified into each class using the Random Forest model.

Two ways to solve the same problem. I leave working out which is best as an exercise to the reader...

piman314
  • 5,285
  • 23
  • 35
  • Nice examples! Both answers mentioned preprocesing though: "transform coordinates" and "transform data to make it linearly separable" is preprocessing. – Mikhail Korobov Nov 17 '16 at 20:22
  • Thanks! Yeah I've sort of jumbled a lot of terminology into one answer. I should be more consistent, but not quite sure which is the best terminology to stick with. – piman314 Nov 18 '16 at 10:42
  • Very nice, i like extra details! Thank you for sharing.. learned some : ) – Yev Guyduy Sep 19 '22 at 18:16
1

As others said, Logistic Regression can't handle this kind of data well because it is a linear classifier. You may transform data to make it linearly separable, or choose another classifier which is better for such kind of data.

There is a nice visualisation of how various classifiers handle this problem in scikit-learn docs: see http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html. Second row is for your task:

enter image description here

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65