0

I am using scikit-learn and experimenting Kmeans. Its fast but requires number of clusters as an argument. What i would like to try is to automatically computer number of clusters for based on population of documents.

hash-based near-neighbor algorithms (ssdeep) i used before can get similarity clusters based on distance , how can i get cluster size automatically for k means .

KMeans(init='k-means++', n_clusters=cluster_count, n_init=10),
          name="k-means++", data=data)

I want to calculate that cluster_count automatically , is that possible? my test dataset is collection of random files from 20_newsgroup , not pre-categorize into folder , single folder , so no labels.

Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55
  • 3
    You can try various values of k, then pick the best clustering by some evaluation metric (see `sklearn.metrics`). – Fred Foo Dec 03 '12 at 13:23
  • from the clustering document , i guess `4.3.3. Affinity propagation` is what i looking for, but wont be fast like Kmeans right? Do k-means support something like guessing number of clusters in affinity propagation? – Phyo Arkar Lwin Dec 03 '12 at 14:12
  • i tested Affinity Propagation on selected docs of 20_newsgroups dataset (it have 19095 documents) and it eats up all RAM (6 GB out of 8 GB , and 5GB of swap) . So i guess it is useless for big dataset. What do you recommend? DBSCAN ? – Phyo Arkar Lwin Dec 03 '12 at 14:49
  • 3
    DBSCAN might work, though it doesn't scale very well to large numbers of samples because of its O(n²) complexity. (It could have been O(n lg n) with a smarter algorithm, but we never implemented that.) – Fred Foo Dec 03 '12 at 14:55
  • i c, is there any plan to implement O(n log n) version? – Phyo Arkar Lwin Dec 03 '12 at 19:19
  • Not that I'm aware of. If you want to try your hand at it, be my guest. – Fred Foo Dec 03 '12 at 20:32
  • I think mean shift could also work, though I'm not sure how good it scales. – Andreas Mueller Dec 04 '12 at 09:08
  • I think you may want to try BIRCH algorithm. – Naruil May 27 '13 at 04:39
  • +1 to @FredFoo. I was just trying to solve a similar problem and ended up using silhouette plots and storing the average `silhouette_score` for a hand-picked range of values for `n_clusters`. Then fed the best value into my final `KMeans` function. It only took a few minutes to code and when I verified some sets by hand it looks like it made reasonable choices for each data set. – ZSH May 25 '16 at 18:47

0 Answers0