1

I have a list of filenames in R. I have to cluster similar filenames. To do that I used stringdistmatrix to find the distance between each of the strings. I have string distances but I am having a hard time dividing them in clusters. Until now my approach was rather simple, I used 2 for loops to traverse the entire string distance matrix. So for each distance which was less than 5, I put its column name in a list and set the distance to NA for the rest of the row. When I do this in a for-loop, it takes almost 15 minutes to run once. Is there any method to do it faster? This is what I have done till now:

kpm <- stringdistmatrix(unique(dat1$name),useNames="strings",method="lv")
kpm <- data.matrix(as.matrix(kpm))
i<-1
j<-1
v<-1
vec<-list()
l<-1
lt<-NULL
while(i<=nrow(kpm)){
  l<-1
  j<-1
  while(j<=ncol(kpm)){

    if(kpm[j][1]>=0 & kpm[j][1]<=8){

      lt[l]<-colnames(kpm)[j]
      l<-l+1
      kpm[j][]<-NA

    }
    j<-j+1
  }
  i<-i+1
  if(length(lt)>1){
    vec[[v]]<-lt
    lt<-NULL
    v<-v+1
}
}
RIP71DE
  • 51
  • 3
  • how many file names do you have? – friep Jul 04 '17 at 07:12
  • Welcome to Stackoverflow!, you have explained issue well, it would have been perfect with a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Meanwhile, can you try this `kpm <- stringdistmatrix(unique(dat1$name),useNames="strings",method="lv"); plot(hclust(kpm,method = "ward"))`. For a test output I used dummy text as `kpm <- stringdistmatrix(unique(rownames(mtcars)),useNames="strings",method="lv"); plot(hclust(kpm,method = "ward"))`. – Silence Dogood Jul 04 '17 at 07:23
  • @friep there are somewhere around 3000 to 4000 unique filenames in each table and almost 400 such tables. – RIP71DE Jul 05 '17 at 08:06

0 Answers0