I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.
I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior
Does anyone know what kind of problem am I running into?