Text clustering with Levenshtein distances

Question

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.

Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance). For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

It can depend on the type of hierarchical clustering you use. [Single linkage](http://en.wikipedia.org/wiki/Single-linkage_clustering) & [complete linkage](http://en.wikipedia.org/wiki/Complete-linkage_clustering) HC can be performed w/ just a distance matrix, so once you have that by whatever method, normal clustering functions (eg, `hclust`) should work fine. OTOH, *average* linkage or Ward's method require recalculating the distances at each step, so they would be more complicated to implement. — gung - Reinstate Monica, Feb 02 '14 at 15:07
So in MATLAB [Z = linkage(Y,method)](http://www.mathworks.com/help/stats/linkage.html#inputarg_method) would work with distance matrix calculated and the complete method for example. Right? — Alexandros, Feb 02 '14 at 15:22
I'd have to just guess that the answer is 'yes'. It has been a long time since I've used MATLAB, & I never did any clustering w/ it. — gung - Reinstate Monica, Feb 02 '14 at 15:26

jlhoward · Accepted Answer · 2014-02-02T18:39:23.177

41

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

edited Feb 02 '14 at 18:39

answered Feb 02 '14 at 16:57

jlhoward

58,004
7
97
140

So, d <- adist(str) computes levenshtein for all strings (si->sj)? Also, do I need to include a R package for it to work? – Alexandros Feb 02 '14 at 17:17
1

`adist(...)` is in the `utils` package which normally loads by default when you start an R session. It calculates a full distance matrix, which is why you need `as.dist(d)` to convert it to something `hclust(...)` understands as a distance object. Type `?adist` for the documentation. – jlhoward Feb 02 '14 at 17:26
is there a way to find out the number of clusters with this solution? – user3570187 Aug 09 '15 at 19:52

score 4 · Answer 2 · answered Feb 03 '14 at 14:47

4

ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.

Text clustering support was contributed by Felix Stahlberg, as part of his work on:

Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T.
Word segmentation through cross-lingual word-to-phoneme alignment.
Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.

We would of course appreciate additional contributions.

answered Feb 03 '14 at 14:47

Erich Schubert

8,575
2
26
42

4

+1. I have heard of ELKI and many coleagues of mine use it. ELKI is a valid option if you want to invest the necessary time. But the R code is 10 lines long when in Elki I will have to overload many Java classes just to get an initial look at results. It is better to have a fast look on initial results, even when based on a non-optimal algorithm than waste 15-30 days to learn a framework just to see that my approach is not correct. So, R for now is fine. Later, ELKI might be a better solution. – Alexandros Feb 03 '14 at 14:55
1

ELKI needs a scripting API, I've been considering to add Groovy, but didn't get time to do so yet. With R, I haven't been too happy because of performance. Anything that is not a matrix is slow, and matrix operations scale with `O(n^2)` or worse. If I want to try out something quickly, I find scipy usually to be the best scripting language, and more often than not it is surprisingly fast due to Cython code. – Erich Schubert Feb 03 '14 at 15:43

score 3 · Answer 3 · answered Feb 02 '14 at 16:34

While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).

Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.

In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.

You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:

clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")

dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!

I have looked at sequence pattern matching but defining an alphabet seems an overkill, because distance(abc,abd) = distance(abc,abf). So why define a dictionary, since we only check for inequality f!=c is the same as e!=c. Still, +1 for your effort. — Alexandros, Feb 02 '14 at 17:12
The alphabet is defined automatically, sampling all possible states. You can alter it, of course. Regular OM algorithm would assign exactly the same distance between (abc,abd) and (abc,abf). The distance is based on one substitution in both cases, and its cost is identical, assuming you did not assign a differential cost to these specific letters. Of course, if your problem is as simple as the other solution, it's perfectly fine. You could also use PAM instead of HC. — Maxim.K, Feb 02 '14 at 20:36

score 2 · Answer 4 · answered Feb 03 '14 at 09:56

2

If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms. https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub

The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.

Best of Luck!

answered Feb 03 '14 at 09:56

TheVoiceInMyHead

73
5

So, does R has this modified (PAM-like) version of iK-Means implemented? Does this method automatically extracts number of clusters? If not, there is no way to actual implement a clustering algorithm by myself, even if it is really the state-of-the-art. Also, the slow part must be the distance matrix and not the clustering. – Alexandros Feb 03 '14 at 11:37
... still +1 for your contribution – Alexandros Feb 03 '14 at 14:57
Yes, it automatically gets the number of clusters. To my knowledge there is no R version of it, however, it shouldn't be hard to implement (its just about 80 lines in Matlab). – TheVoiceInMyHead Feb 24 '14 at 09:46

Text clustering with Levenshtein distances

4 Answers4

Linked

Related