Text Documents Clustering - Non Uniform Clusters

Question

I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.

I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior

Does anyone know what kind of problem am I running into?

k-means is not very robust against outliers. 1 element clusters usually are outliers. — Has QUIT--Anony-Mousse, Feb 25 '15 at 22:39
Yes that's precisely I have been thinking as I suspect a lot of outliers. I am wondering which algorithm might be a good fit in this case. — apurva.nandan, Feb 26 '15 at 08:59

score 1 · Answer 1 · edited May 23 '17 at 10:26

Here are the possible things that might be going "wrong":

Your k-means cluster initialization points are chosen as the same set of points in each run. I recommend using the 'random' for the init parameter of k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. If that doesn't work then supply to k-means your own set of random initial cluster centers. Remember to initialize your random generator using its seed() method as the current date and time. https://docs.python.org/2/library/random.html uses current date-time as the default value.
Your distance function, i.e. euclidean distance might be the culprit. This is less likely but it is always good to run k-means using cosine similarity especially when you are using it for document similarity. scikits doesn't have this functionality at present but you should look here: Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

These two combined should give you good clusters.

score 0 · Accepted Answer · answered Sep 10 '15 at 08:19

I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved. Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.

Text Documents Clustering - Non Uniform Clusters

2 Answers2