K-means text documents clustering. How calculate intra and inner similarity?

Question

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.

I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.

External similarity calculated as the average similarity of all pairs cluster centroid

I count right? It is based on my inner similarity values average from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.

Thank you so much for your advice!

@larsmans still it works well with other distance metrics as well, e.G. the cosine distance. — Thomas Jungblut, May 03 '13 at 16:36
@ThomasJungblut: or L1 distance, but only if you take medians instead of means. The reason why it works with cosine distance is probably because that's a [trivial transformation of Euclidean distance](http://stackoverflow.com/a/13662112/166749). — Fred Foo, May 03 '13 at 19:11

score 1 · Answer 1 · answered May 04 '13 at 10:12

1

Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!

For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

answered May 04 '13 at 10:12

Has QUIT--Anony-Mousse

76,138
12
138
194

If I reprezent all document only in 1. quadrant. Can I use cosine similarity. For a small count of the document, I get everything debugoval and the results came out as they shouldFor small documents.Medoid They have already attacked me but I do not think I have a solution wrong. – Tomáš jedno May 05 '13 at 16:27
I don't know of a proof that k-means always converges with cosine distance. Does cosine distance minimize the sum of squares? Probably not. – Has QUIT--Anony-Mousse May 05 '13 at 16:42

K-means text documents clustering. How calculate intra and inner similarity?

1 Answers1