3

I would like to speed up the following code. Could some one please be so kind and make some suggestions?

library(dplyr)
library(fuzzywuzzyR)

set.seed(42)
rm(list = ls())
options(scipen = 999)

init = FuzzMatcher$new()

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

distance_function <- function(string_1, string_2) {
    init$Token_set_ratio(string1 = string_1, string2 = string_2)
}

combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))

distance_matrix <- matrix(NA, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))

for (i in 1:ncol(combinations)) {

    distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])

    #print(data[combinations[1, i], 1])
    #print(data[combinations[2, i], 1])
    #print(distance)

    distance_matrix[combinations[1, i], combinations[2, i]] <- distance
    distance_matrix[combinations[2, i], combinations[1, i]] <- distance

}

distance_matrix

By the way I tried to use proxy::dist and various other approaches without success. I also do not think that the string distance function works as expected but that's another story.

Ultimately, I want to use the distance matrix to perform some clustering to group similar stings (independent of word order).

cs0815
  • 16,751
  • 45
  • 136
  • 299
  • can you please profile where it takes more time. If it is the `distance_function` that takes more time, then it would be difficult with the current setup – akrun May 02 '19 at 15:31
  • not sure how to do this but perhaps the loop could be change to an apply? sorry still an R virgin – cs0815 May 02 '19 at 15:33
  • you can use `profvis`. check [here](https://rstudio.github.io/profvis/) – akrun May 02 '19 at 15:33
  • thanks - good to know - will have a go but not sure if this only work with r studio, which I do not use ... – cs0815 May 02 '19 at 15:36
  • If `proxy::dist` was still too slow for you, then you might have to implement your own function in C or C++. I recently showed an [example with multi-threading](https://stackoverflow.com/a/55666677/5793905) using a geographical distance, but you could adjust it to support strings and output the complete matrix. Also check [this example](http://gallery.rcpp.org/articles/parallel-distance-matrix/). – Alexis May 04 '19 at 13:53

1 Answers1

2

If you want a matrix, you can use the stringdist package. From what I could tell, the package you were using calculated Levenshtein Distance so I included method = "lv" (you could try other methods too). Let me know if you have issues, or if a format other than a matrix would be preferred. Also, you may consider using a method other than Levenshtein Distance (i.e., a change of 2 in a four letter word appears the same as a change of two in a 20 word sentence). Good luck!!!

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")

rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string

dist_mat
                        hello world hello vorld hello world 1 hello world hello world hello world
hello world                       0           1             2           0                      12
hello vorld                       1           0             3           1                      13
hello world 1                     2           3             0           2                      11
hello world                       0           1             2           0                      12
hello world hello world          12          13            11          12                       0
Andrew
  • 5,028
  • 2
  • 11
  • 21
  • thanks I am aware of stringdist but lv is not good enough - see Token_set_ratio. hence my implementation from scratch ... – cs0815 May 02 '19 at 15:52
  • Ahh I see, I didn't catch that when I first looked at it. Apologies! When I ran your code it gave me a matrix of `NA`'s so I couldn't check the output. I am having trouble understanding `Token_set_ratio`. I still can't get it to work for me but it may be something simple that I am missing. Could you explain a little about why you prefer it over other distance measures (i.e., why it best suites your data). Looks like it extracts alpha-numeric characters, makes a ratio of intersecting vs leftover characters. So, would "green" and "genre" be a perfect match? – Andrew May 02 '19 at 16:28
  • 1
    You mentioned the apply family in your question. If it works the same way that regular string distance works, you could always use `sapply` and it will output a matrix for this. For example (using `stringdist`): `sapply(data$string, stringdist, data$string, method = "lv")` – Andrew May 02 '19 at 16:31