1

I have a set of points that I have clustered using a clustering algorithm (k-means in this case). I also know the ground-truth labels and I want to measure how accurate my clustering is. What I need is to find the actual accuracy. The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.

Is there a way to measure this accuracy? The intuitive idea would be to compute the score of the confusion matrix of every combination of labels, and only keep the maximum. Is there a function that does this?

I have also evaluated my results using rand scores and adjusted rand score. How close are these two measures to actual accuracy?

Thanks!

Alfred
  • 503
  • 1
  • 5
  • 20

3 Answers3

2

First of all, what does The problem, of course, is that the labels given by the clustering do not match the ordering of the original one. mean?

If you know the ground truth labels then you can re-arrange them to match the order of the X matrix and in that way, the Kmeans labels will be in accordance with the true labels after prediction.


In this situation, I suggest the following.

  • If you have the ground truth labels and you want to see how accurate your model is, then you need metrics such as the Rand index or mutual information between the predicted and true labels. You can do that in a cross-validation scheme and see how the model behaves i.e. if it can predict correctly the classes/labels under a cross-validation scheme. The assessment of prediction goodness can be calculated using metrics like the Rand index.

In summary:

  • Define a Kmeans model and use cross-validation and in each iteration estimate the Rand index (or mutual information) between the assignments and the true labels. Repeat that for all iterations and finally, take the mean of the Rand index scores. If this score is high, then the model is good.

Full example:

from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np

# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()

rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   # the model
   kmeans = KMeans(n_clusters=3, random_state=0)
   kmeans.fit(X_train) # fit using training data
   predicted_labels = kmeans.predict(X_test) # predict using test data
   rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels

print(np.mean(rand_index_scores))
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thanks! I still have a couple of questions: Why do I need the original data ( X in your case)? I have tried it on my dataset, and the accuracy I get is way worse than random guessing, which is definitely wrong. To conclude, why do I get three different answer when I have 10 clusters? Thank you again! – Alfred Dec 16 '19 at 18:30
1

Since clustering is an unsupervised learning problem, you have specific metrics for it: https://scikit-learn.org/stable/modules/classes.html#clustering-metrics

You can refer to the discussion in the scikit-learn User Guide to have an idea of the differences between the different metrics for clustering: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

For instance, the adjusted Rand index will compare a pair of points and check that if the labels are the same in the ground-truth, it will be the same in the predictions. Unlike the accuracy, you cannot make strict label equality.

glemaitre
  • 963
  • 6
  • 7
-1

you can use sklearn.metrics.accuracy as documented in link mentioned below

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

an example can be seen in link mentioned below

sklearn: calculating accuracy score of k-means on the test data set