3

I'm trying to cluster a bunch of 34-element vectors (~200,000) using sklearn.cluster.KMeans and assess the results using sklearn.metrics.silhouette_score; this is the subject of the question How to use silhouette score in k-means clustering from sklearn library?, but with the following difference: For that question, the data came from a DataFrame, whereas the vectors for my problem are represented by the rows of a CSR matrix. Here's an excerpt of the code:

sparse_vectors = sp.coo_matrix(...).tocsr()
n_clusters = 256
kmeans_clusterer = MiniBatchKMeans( n_clusters = n_clusters )
cluster_labels = kmeans_clusterer.fit_predict( sparse_vectors )
print( "done with KMeans, cluster_labels.shape", cluster_labels.shape, file = sys.stderr )
clustering_score = skl.metrics.silhouette_score( sparse_vectors, cluster_labels, metric = "euclidean" )
print( "n_clusters", n_clusters, "clustering_score", clustering_score, file = sys.stderr )

What I observe from the first print statement is that the clustering takes under a second, whereas I have not yet had enough patience to wait for the second print statement.

Is there something I can do to speed up the call to silhouette_score, or alternatively, is there a faster way to evaluate the quality of the clustering, so that I can decide what value to use for n_clusters?

Mark Lavin
  • 1,002
  • 1
  • 15
  • 30
  • I think the answer to "is there a faster way?" is >yes<. I implemented a very simple Python function to compute the RMS average intracluster distance (distance from each vector to the center of the cluster it belongs to) and it runs in a few seconds. – Mark Lavin Nov 30 '19 at 18:56
  • 2
    `silhouette_score`'s `kwds` get passed to the distance function, so pass `n_jobs=-1` to speed it up. – Matt Eding Nov 30 '19 at 19:22
  • 3
    @MarkLavin could you post your alternative? – Anonymous Jul 07 '20 at 10:22

0 Answers0