I have a list of filenames in R. I have to cluster similar filenames. To do that I used stringdistmatrix to find the distance between each of the strings. I have string distances but I am having a hard time dividing them in clusters. Until now my approach was rather simple, I used 2 for loops to traverse the entire string distance matrix. So for each distance which was less than 5, I put its column name in a list and set the distance to NA for the rest of the row. When I do this in a for-loop, it takes almost 15 minutes to run once. Is there any method to do it faster? This is what I have done till now:
kpm <- stringdistmatrix(unique(dat1$name),useNames="strings",method="lv")
kpm <- data.matrix(as.matrix(kpm))
i<-1
j<-1
v<-1
vec<-list()
l<-1
lt<-NULL
while(i<=nrow(kpm)){
l<-1
j<-1
while(j<=ncol(kpm)){
if(kpm[j][1]>=0 & kpm[j][1]<=8){
lt[l]<-colnames(kpm)[j]
l<-l+1
kpm[j][]<-NA
}
j<-j+1
}
i<-i+1
if(length(lt)>1){
vec[[v]]<-lt
lt<-NULL
v<-v+1
}
}