My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

- 1,224
- 1
- 13
- 34

- 851
- 2
- 9
- 13
-
take a look at http://code.google.com/p/em-python/ and "http://en.wikipedia.org/wiki/Expectation–maximization_algorithm" – Moj Apr 26 '13 at 22:21
-
there is also http://www.pymix.org/pymix/index.php?n=PyMix.Tutorial – Moj Apr 26 '13 at 22:25
-
@Moj I'm sorry...I can't seem to figure out how the information contained in the links you have mentioned are relevant here – user2115183 Apr 26 '13 at 22:26
-
(EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the. I guess this fit you goal as also don't know the number of clusters before hand. those are two libraries(or implementation ) of this algorithm . – Moj Apr 26 '13 at 22:28
-
1@Moj I was hoping something along the lines of k-means or hierarchical clustering...i know these require the number of clusters to be known beforehand.....but i hope there are ways to figure out the optimum number of clusters – user2115183 Apr 26 '13 at 22:32
3 Answers
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.

- 27,470
- 8
- 62
- 74
-
6can you provide a [reproducible example](http://stackoverflow.com/help/mcve) of a scikit-learn algorithm using a distance matrix as input? – Bryan Nov 13 '14 at 14:56
-
1There is one here: http://scikit-learn.org/dev/auto_examples/cluster/plot_segmentation_toy.html – Andreas Mueller Nov 13 '14 at 20:36
-
4Is there somewhere a list of algorithms in sklearn that can take precomputed distance matrix? I found, for example that while DBSCAN does accept it, a very similar algorithm, OPTICS does not. In AgglomerativeClustering 'ward' linkage does not, while other linkages do. – subhacom Oct 04 '18 at 18:23
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.

- 1,585
- 1
- 14
- 20