compute clustersize automatically for kmeans

Question

I am using scikit-learn and experimenting Kmeans. Its fast but requires number of clusters as an argument. What i would like to try is to automatically computer number of clusters for based on population of documents.

hash-based near-neighbor algorithms (ssdeep) i used before can get similarity clusters based on distance , how can i get cluster size automatically for k means .

KMeans(init='k-means++', n_clusters=cluster_count, n_init=10),
          name="k-means++", data=data)

I want to calculate that cluster_count automatically , is that possible? my test dataset is collection of random files from 20_newsgroup , not pre-categorize into folder , single folder , so no labels.

You can try various values of k, then pick the best clustering by some evaluation metric (see `sklearn.metrics`). — Fred Foo, Dec 03 '12 at 13:23
from the clustering document , i guess `4.3.3. Affinity propagation` is what i looking for, but wont be fast like Kmeans right? Do k-means support something like guessing number of clusters in affinity propagation? — Phyo Arkar Lwin, Dec 03 '12 at 14:12
i tested Affinity Propagation on selected docs of 20_newsgroups dataset (it have 19095 documents) and it eats up all RAM (6 GB out of 8 GB , and 5GB of swap) . So i guess it is useless for big dataset. What do you recommend? DBSCAN ? — Phyo Arkar Lwin, Dec 03 '12 at 14:49
DBSCAN might work, though it doesn't scale very well to large numbers of samples because of its O(n²) complexity. (It could have been O(n lg n) with a smarter algorithm, but we never implemented that.) — Fred Foo, Dec 03 '12 at 14:55
Not that I'm aware of. If you want to try your hand at it, be my guest. — Fred Foo, Dec 03 '12 at 20:32
I think mean shift could also work, though I'm not sure how good it scales. — Andreas Mueller, Dec 04 '12 at 09:08
+1 to @FredFoo. I was just trying to solve a similar problem and ended up using silhouette plots and storing the average `silhouette_score` for a hand-picked range of values for `n_clusters`. Then fed the best value into my final `KMeans` function. It only took a few minutes to code and when I verified some sets by hand it looks like it made reasonable choices for each data set. — ZSH, May 25 '16 at 18:47

compute clustersize automatically for kmeans

0 Answers0

Linked