1

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.

I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.

External similarity calculated as the average similarity of all pairs cluster centroid

I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.

Thank you so much for your advice!

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • *k*-means uses Euclidean distance, not a similarity score. – Fred Foo May 03 '13 at 16:09
  • 1
    @larsmans still it works well with other distance metrics as well, e.G. the cosine distance. – Thomas Jungblut May 03 '13 at 16:36
  • @ThomasJungblut: or L1 distance, but only if you take medians instead of means. The reason why it works with cosine distance is probably because that's a [trivial transformation of Euclidean distance](http://stackoverflow.com/a/13662112/166749). – Fred Foo May 03 '13 at 19:11

1 Answers1

1

Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!

For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • If I reprezent all document only in 1. quadrant. Can I use cosine similarity. For a small count of the document, I get everything debugoval and the results came out as they shouldFor small documents.Medoid They have already attacked me but I do not think I have a solution wrong. – Tomáš jedno May 05 '13 at 16:27
  • I don't know of a proof that k-means always converges with cosine distance. Does cosine distance minimize the sum of squares? Probably not. – Has QUIT--Anony-Mousse May 05 '13 at 16:42