0

I've tried out this solution for Hierarchical clustering based on Levenshtein distance as mentioned in this answer: https://stackoverflow.com/a/21513338/14485257

The code for the same is as follows:

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

In this example, there are 30 strings being clustered and this solution works perfectly fine. But when I apply this same code for anything more than 15,000 strings I get an error looking something like this:

Error: cannot allocate vector of size 74.5 Gb

The total no. of strings that I need to cluster is actually around 500,000. So, is there any way around for solving this issue?

EnigmAI
  • 157
  • 1
  • 9
  • At 15000 strings, your distance matrix will have 225,000,000 entries. At 500,000 string, the distance matrix will have 250,000,000,000 entries. It is not surprising that you run out of memory. – G5W Nov 20 '20 at 13:44
  • That's true. Is there any other efficient way to deal with it then? – EnigmAI Nov 22 '20 at 14:58
  • You could try something like the hybrid solution proposed at [hclust() in R on large datasets](https://stackoverflow.com/q/40989003/4752675) – G5W Nov 22 '20 at 15:08

0 Answers0