0

I am trying to implement a K-Means algorithm into my binary classification task, but I cannot plot a scatter graph of the resulting two clusters.

My dataset is simply in the following form:

# size, class
  312,  1
  319   1
  227   0       

The minimal example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster         import KMeans

X = {'size': [312,319,227,301,273,311,277,291,303,381], 'class': [1,1,0,1,0,1,0,0,1,1]}
X = pd.DataFrame(data=X)
X_train, X_test, y_train, y_test = train_test_split(X['size'], X['class'], test_size=0.4)
X_train = X_train.values.reshape(-1,1)
X_test  = X_test.values.reshape(-1,1)

kmeans = KMeans(init="k-means++", n_clusters=2, n_init=10, max_iter=300, random_state=42)

kmeans.fit(X_train)
preds = kmeans.predict(X_test)

How can I plot a scatter plot that shows the two clusters, the samples in "X_test" and corresponding colors (for 0 and 1) according to the predictions "preds"?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
bbasaran
  • 364
  • 3
  • 15
  • 2
    is there any specific error you are facing ? Can you please add some details on the issue you are facing. Also why are you splitting the df into X and Y. Kmeans is an unsupervised learning and normally does not have a target value ( target being y) as in case of supervised learning models. – heretolearn Jul 07 '21 at 12:09
  • Thanks for the answer @heretolearn. I know that it is an unsupervised method, but I just want to see if I can classify the data based on the "size" feature, and I want to evaluate how successful the clustering was, by comparing the true labels. I face an error about the data shape. – bbasaran Jul 07 '21 at 12:40
  • Would [this approach](https://stackoverflow.com/a/66331929/2912349) solve your problem? – Paul Brodersen Jul 07 '21 at 13:30

1 Answers1

1

As you only have one feature, all your data is on a line. You can create your scatter plot like that :

color = ["blue", "red"]
plt.scatter(X_test.flatten(), [0]*len(X_test), c=[color[p] for p in preds])

If you want to have two features, you can modify your data :

X = {
    'size_1': [312,319,227,301,273,311,277,291,303,381],
    'size_2': [152,165,301,145,310,145,315,156,160,165],
    'class': [1,1,0,1,0,1,0,0,1,1],
}
X = pd.DataFrame(data=X)
X_train, X_test, y_train, y_test = train_test_split(X[['size_1', 'size_2']], X['class'], test_size=0.4)

And you modify the scatter plot :

plt.scatter(X_test.iloc[:,0],X_test.iloc[:,1], c=[color[p] for p in preds])
Pierre-Loic
  • 1,524
  • 1
  • 6
  • 12