I am using scikit-learn and experimenting Kmeans. Its fast but requires number of clusters as an argument. What i would like to try is to automatically computer number of clusters for based on population of documents.
hash-based near-neighbor algorithms (ssdeep) i used before can get similarity clusters based on distance , how can i get cluster size automatically for k means .
KMeans(init='k-means++', n_clusters=cluster_count, n_init=10),
name="k-means++", data=data)
I want to calculate that cluster_count automatically , is that possible? my test dataset is collection of random files from 20_newsgroup , not pre-categorize into folder , single folder , so no labels.