How to cluster docs based on their similarity measures?

Question

I read the posts on the topic like how to cluster docs based on their similarity here. But I still can not understand how it realizes it. My test is that I have the cos similarity measures of 10 docs. Below are some:

D1  D2   sim(D1,D2)

d1  d10 0.6823 
d1  d2  0.6377 
d1  d8  0.0307 
d1  d9  0.0294 
d1  d7  0.0284 
d1  d3  0.0234 
d1  d4  0.0199 
d1  d6  0.0110 
d1  d5  0.0030 
d10 d2  0.7232 
d10 d3  0.3898 
d10 d4  0.3054 
d10 d9  0.0256 
d10 d7  0.0227 
d10 d8  0.0226 
d10 d6  0.0110 
d10 d5  0.0060 
d2  d3  0.7850 
...
...

Can I cluster these docs solely based on the similarity measures? If I specify the number of clusters, how to do it? If I do not specify the number of clusters, can the algorithm automatically cluster those docs, how to do it? Thanks in advance.

score 0 · Accepted Answer · answered Feb 15 '15 at 07:01

Clustering is one of the biggest areas of machine learning (proportionally you could compare it to, say, "integration" in mathematics, or "sorting" in programming), and there are literally hundreds of different algorithms, focused on different problem settings and requirements. Some of them require to specify the number of clusters, some don't. Some can work with just the pairwise similarity, some require some explicit representation of the items being clustered, etc.

I suggest you to start with two of the classical clustering algorithms:

http://en.wikipedia.org/wiki/K-means_clustering - here, you specify the number of clusters ("k") in advance, however the objects being clustered have to be points in a vector space (there are ways to reduce the problem of document clustering to a vector space problem - search for "term vector representation"). Since you're dealing with cosine similarity, looks like you already have a vector space, so you can use K-means.
http://en.wikipedia.org/wiki/Hierarchical_clustering (in particular, "single-linkage agglomerative clustering" http://en.wikipedia.org/wiki/Single-linkage_clustering) - here, you only need the pairwise similarities: you build a tree by repeatedly finding the two most similar documents and joining them into the same cluster, until you have the desired number of clusters.

score 0 · Answer 2 · answered Feb 15 '15 at 10:40

Various clustering algorithms operate on pairwise distsnces; and many can be adapted to work on pairwise similarities, too.

Hierarchical Agglomerative Clustering (HAC) is the prototype for this. It works on a distance or similarity matrix, and merges the most similar clusters starting with single documents. Other algorithms include DBSCAn, OPTICS, ...

k-means is the opposite. It computes means, and distances from the mean. It doesn't work with similarities or other distances than squared Euclidean well because of uding the mean. The mean minimizes least squares, not distances. However, sometimes you have a way out. If your data is normalized to the non-negative unit sphere, then squared Euclidean d2(a,b)= 2 - 2*cos(a,b). And thus, spherical k-means works too. Other algorithms with reliance on coordinates and means include Mean-Shift and BIRCH.

thanks so much for your answer, but I can only pick one answer as answer. Here at SF I also read many of your suggestions and links to your research page.. — warren, Feb 15 '15 at 15:40

How to cluster docs based on their similarity measures?

2 Answers2