sklearn kmeans.predict method dosnt work correct

Question

I used sklearn for implementing k-means method. the k-means class has a method, called "predict". to predict new samples according to trained sample.

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
'''
make sample
'''
X, y=make_blobs(n_samples=100, n_features=2, centers=3)

'''
kmeans
'''
kmeans_obj=KMeans(n_clusters=3)

#train
kmeans_obj.fit(X)

#labels:
labels=kmeans_obj.predict(X)


'''
output
'''
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()

'''
generate new samples and predict them
'''
while True:
    '''
    perdict kmeans?!?!?!?
    '''
    new_X, new_y=make_blobs(n_samples=50, n_features=2, centers=4)

    perdict_new_sample_lables=kmeans_obj.predict(new_X)

    plt.scatter(X[:,0], X[:,1], c=labels)
    plt.scatter(new_X[:,0], new_X[:,1], c=perdict_new_sample_lables, marker="x")
    plt.show()

sometimes it works ok:

but sometimes it doesn't:

circle shape in the pictures are trained dataset. and cross shape in the picture are new element that predicted.

the problem here isnot with deterministic, nondeterministic of the result. in nondeterministic algorithm output change in every run. but the result here is completely wrong!! in picture 2 violet cross must be Green

The problem is in your `while true` loop. Inside that you are generating test data using the `make_blobs()`. This test data will not match the training data, hence no matches. You need to first generate all data and then divide it into train and test. Each time you use `make_blobs` method, the new data may not match with old data. Hope this is making sense to you. You also need to brush up your clustering basics. — Vivek Kumar, May 20 '17 at 07:45

score 2 · Answer 1 · edited May 23 '17 at 12:26

2

K-means is not a deterministic algorithm, and the cluster assignment depends on the distribution of the data and the randomness of the algorithm in initializing. You can counter this issue by setting a seed using random.seed() function of using the random_state parameter in k-means. Please refer to the following pages for more on this:

edited May 23 '17 at 12:26

Community

1
1

answered May 20 '17 at 04:03

manojps

311
2
7

the problem here isnot with deterministic, nondeterministic of the result. in nondeterministic algorithm output change in every run. but the result here is completely wrong!! in pic2 violet cross must be Green. – pd shah May 20 '17 at 04:13
Sorry that I didn't notice your notes at the end of the question before. However, after close inspection I think it is not an error with predictions, it's an issue with labeling. The labels are not consistent between the training set and test set. It is very common in Matlab's k-means implementation. I have never encountered it in scikit before. But you should take a look at scikit's LabelEncoder - (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). – manojps May 20 '17 at 06:54

sklearn kmeans.predict method dosnt work correct

1 Answers1