0

I'm trying to cluster a list of strings based on their similarity.

For example, if my strings are: ABCD45, ABCD67, ABCD921, XYZ12, XYZ94

Then after clustering,

ABCD45, ABCD67 & ABCD921 will be assigned to Cluster 1

XYZ12 & XYZ94 will be assigned to Cluster 2

I've tried out this solution for Hierarchical clustering based on Levenshtein distance as mentioned in this answer: https://stackoverflow.com/a/21513338/14485257

The code for the same is as follows:

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

In this example there are only a total of 30 strings for which the value of k=3 has been randomly selected and serves the purpose. But when I use this same approach on my data which has 6000 strings and keep varying the k value, the cluster distribution keeps changing drastically.

The output of the aforementioned code on the example strings is exactly how I need it - it assigns the cluster no. against each string. But my issues are the following:

  1. How to select the most appropriate k value?
  2. If I have a relatively larger list of say 50,000 strings, should I still follow the same approach or some other clustering technique altogether?

My final objective is to eventually get a dataframe having each string assigned to a particular cluster number.

EnigmAI
  • 157
  • 1
  • 9
  • Hi @EnigmAI, How did you end up approaching this? – Brian Petro Jan 12 '21 at 13:07
  • @BrianPetro regarding my 1st query for ascertaining the best k value, you can read more about the Silhouette Method and some other techniques described here which I found useful: [link](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/) – EnigmAI Jan 13 '21 at 16:36
  • @BrianPetro But couldn't really get through my second query regarding dealing with a huge bunch of strings. The only way I could figure out is making use of any cloud computing service like Azure Databricks or Digital Ocean where you can select a high configuration machine to do the heavy computation, though you'll have to pay for it. – EnigmAI Jan 13 '21 at 16:38

0 Answers0