Cannot plot K-Means clusters for one-dimensional data

Question

I am trying to implement a K-Means algorithm into my binary classification task, but I cannot plot a scatter graph of the resulting two clusters.

My dataset is simply in the following form:

# size, class
  312,  1
  319   1
  227   0

The minimal example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster         import KMeans

X = {'size': [312,319,227,301,273,311,277,291,303,381], 'class': [1,1,0,1,0,1,0,0,1,1]}
X = pd.DataFrame(data=X)
X_train, X_test, y_train, y_test = train_test_split(X['size'], X['class'], test_size=0.4)
X_train = X_train.values.reshape(-1,1)
X_test  = X_test.values.reshape(-1,1)

kmeans = KMeans(init="k-means++", n_clusters=2, n_init=10, max_iter=300, random_state=42)

kmeans.fit(X_train)
preds = kmeans.predict(X_test)

How can I plot a scatter plot that shows the two clusters, the samples in "X_test" and corresponding colors (for 0 and 1) according to the predictions "preds"?

is there any specific error you are facing ? Can you please add some details on the issue you are facing. Also why are you splitting the df into X and Y. Kmeans is an unsupervised learning and normally does not have a target value ( target being y) as in case of supervised learning models. — heretolearn, Jul 07 '21 at 12:09
Thanks for the answer @heretolearn. I know that it is an unsupervised method, but I just want to see if I can classify the data based on the "size" feature, and I want to evaluate how successful the clustering was, by comparing the true labels. I face an error about the data shape. — bbasaran, Jul 07 '21 at 12:40
Would [this approach](https://stackoverflow.com/a/66331929/2912349) solve your problem? — Paul Brodersen, Jul 07 '21 at 13:30

Pierre-Loic · Accepted Answer · 2021-07-07T13:35:28.943

1

As you only have one feature, all your data is on a line. You can create your scatter plot like that :

color = ["blue", "red"]
plt.scatter(X_test.flatten(), [0]*len(X_test), c=[color[p] for p in preds])

If you want to have two features, you can modify your data :

X = {
    'size_1': [312,319,227,301,273,311,277,291,303,381],
    'size_2': [152,165,301,145,310,145,315,156,160,165],
    'class': [1,1,0,1,0,1,0,0,1,1],
}
X = pd.DataFrame(data=X)
X_train, X_test, y_train, y_test = train_test_split(X[['size_1', 'size_2']], X['class'], test_size=0.4)

And you modify the scatter plot :

plt.scatter(X_test.iloc[:,0],X_test.iloc[:,1], c=[color[p] for p in preds])

edited Jul 07 '21 at 13:35

answered Jul 07 '21 at 12:59

Pierre-Loic

1,524
1
6
12

Thanks for your input @Pierre-Loic. I tried, but it gives a weird result on my actual dataset: https://i.postimg.cc/KzMTZTkb/cluster.png – bbasaran Jul 07 '21 at 13:03
Why is it not ok ? What do you expect ? – Pierre-Loic Jul 07 '21 at 13:08
I expect to see something like this https://i.postimg.cc/ZY3TLHzZ/cluster.png – bbasaran Jul 07 '21 at 13:18
To have something like your picture you need 2 features (vertical and horizontal axis). You only have one in your data (your "size" column) – Pierre-Loic Jul 07 '21 at 13:21
I got it. What if I want to add another feature in my dataset, how should I adjust your code to get a graph I showed? – bbasaran Jul 07 '21 at 13:26

Cannot plot K-Means clusters for one-dimensional data

1 Answers1