Text clustering with Levenshtein distances - Out of Memory Issue

Question

I've tried out this solution for Hierarchical clustering based on Levenshtein distance as mentioned in this answer: https://stackoverflow.com/a/21513338/14485257

The code for the same is as follows:

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

In this example, there are 30 strings being clustered and this solution works perfectly fine. But when I apply this same code for anything more than 15,000 strings I get an error looking something like this:

Error: cannot allocate vector of size 74.5 Gb

The total no. of strings that I need to cluster is actually around 500,000. So, is there any way around for solving this issue?

At 15000 strings, your distance matrix will have 225,000,000 entries. At 500,000 string, the distance matrix will have 250,000,000,000 entries. It is not surprising that you run out of memory. — G5W, Nov 20 '20 at 13:44
That's true. Is there any other efficient way to deal with it then? — EnigmAI, Nov 22 '20 at 14:58
You could try something like the hybrid solution proposed at [hclust() in R on large datasets](https://stackoverflow.com/q/40989003/4752675) — G5W, Nov 22 '20 at 15:08

Text clustering with Levenshtein distances - Out of Memory Issue

0 Answers0