I'm trying to cluster a bunch of 34-element vectors (~200,000) using sklearn.cluster.KMeans
and assess the results using sklearn.metrics.silhouette_score
; this is the subject of the question How to use silhouette score in k-means clustering from sklearn library?, but with the following difference: For that question, the data came from a DataFrame, whereas the vectors for my problem are represented by the rows of a CSR matrix. Here's an excerpt of the code:
sparse_vectors = sp.coo_matrix(...).tocsr()
n_clusters = 256
kmeans_clusterer = MiniBatchKMeans( n_clusters = n_clusters )
cluster_labels = kmeans_clusterer.fit_predict( sparse_vectors )
print( "done with KMeans, cluster_labels.shape", cluster_labels.shape, file = sys.stderr )
clustering_score = skl.metrics.silhouette_score( sparse_vectors, cluster_labels, metric = "euclidean" )
print( "n_clusters", n_clusters, "clustering_score", clustering_score, file = sys.stderr )
What I observe from the first print
statement is that the clustering takes under a second, whereas I have not yet had enough patience to wait for the second print
statement.
Is there something I can do to speed up the call to silhouette_score
, or alternatively, is there a faster way to evaluate the quality of the clustering, so that I can decide what value to use for n_clusters
?