I'm trying to cluster a list of strings based on their similarity.
For example, if my strings are: ABCD45, ABCD67, ABCD921, XYZ12, XYZ94
Then after clustering,
ABCD45, ABCD67 & ABCD921 will be assigned to Cluster 1
XYZ12 & XYZ94 will be assigned to Cluster 2
I've tried out this solution for Hierarchical clustering based on Levenshtein distance as mentioned in this answer: https://stackoverflow.com/a/21513338/14485257
The code for the same is as follows:
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example there are only a total of 30 strings for which the value of k=3 has been randomly selected and serves the purpose. But when I use this same approach on my data which has 6000 strings and keep varying the k value, the cluster distribution keeps changing drastically.
The output of the aforementioned code on the example strings is exactly how I need it - it assigns the cluster no. against each string. But my issues are the following:
- How to select the most appropriate k value?
- If I have a relatively larger list of say 50,000 strings, should I still follow the same approach or some other clustering technique altogether?
My final objective is to eventually get a dataframe having each string assigned to a particular cluster number.