Determining accuracy for k-means clustering

Question

I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?

Heres the code:

from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)

features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica

Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?

k-means is not a classifier. What are you trying to achieve here? — ypnos, Jul 13 '18 at 08:12
Does this answer your question? [sklearn: calculating accuracy score of k-means on the test data set](https://stackoverflow.com/questions/37842165/sklearn-calculating-accuracy-score-of-k-means-on-the-test-data-set) — fuenfundachtzig, Sep 13 '20 at 20:04
I think that this is the measure you need, check the link: https://stackoverflow.com/a/71866136/9862120 — Science Man, Apr 14 '22 at 03:53

score 6 · Accepted Answer · answered Jul 13 '18 at 11:10

You need to take a look at clustering metrics to evaluate your predicitons, these include

Now take Completeness Score for example,

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

For example

from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0

Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.

Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)

Read the official documentation here : Homogenity, completeness and V-measure

score 4 · Answer 2 · answered Jul 13 '18 at 08:29

First of all, you are not classifying, you are clustering the data. Classification is a different process.

The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.

That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.

score 0 · Answer 3 · answered Jun 02 '21 at 09:12

Reference from this blog https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/ You need to got the relation from confusion matrix with Hungarian algorithm. The code is below:

from scipy.optimize import linear_sum_assignment as linear_assignment
def cluster_acc(y_true, y_pred):
    cm = metrics.confusion_matrix(y_true, y_pred)
    _make_cost_m = lambda x:-x + np.max(x)
    indexes = linear_assignment(_make_cost_m(cm))
    indexes = np.concatenate([indexes[0][:,np.newaxis],indexes[1][:,np.newaxis]], axis=-1)
    js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
    cm2 = cm[:, js]
    acc = np.trace(cm2) / np.sum(cm2)
    return acc

Or just import library coclust

from coclust.evaluation.external import accuracy
accuracy(labels, predicted_labels)

Determining accuracy for k-means clustering

3 Answers3

Linked