create distance matrix for strings

Question

I would like to speed up the following code. Could some one please be so kind and make some suggestions?

library(dplyr)
library(fuzzywuzzyR)

set.seed(42)
rm(list = ls())
options(scipen = 999)

init = FuzzMatcher$new()

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

distance_function <- function(string_1, string_2) {
    init$Token_set_ratio(string1 = string_1, string2 = string_2)
}

combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))

distance_matrix <- matrix(NA, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))

for (i in 1:ncol(combinations)) {

    distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])

    #print(data[combinations[1, i], 1])
    #print(data[combinations[2, i], 1])
    #print(distance)

    distance_matrix[combinations[1, i], combinations[2, i]] <- distance
    distance_matrix[combinations[2, i], combinations[1, i]] <- distance

}

distance_matrix

By the way I tried to use proxy::dist and various other approaches without success. I also do not think that the string distance function works as expected but that's another story.

Ultimately, I want to use the distance matrix to perform some clustering to group similar stings (independent of word order).

can you please profile where it takes more time. If it is the `distance_function` that takes more time, then it would be difficult with the current setup — akrun, May 02 '19 at 15:31
not sure how to do this but perhaps the loop could be change to an apply? sorry still an R virgin — cs0815, May 02 '19 at 15:33
you can use `profvis`. check [here](https://rstudio.github.io/profvis/) — akrun, May 02 '19 at 15:33
thanks - good to know - will have a go but not sure if this only work with r studio, which I do not use ... — cs0815, May 02 '19 at 15:36
If `proxy::dist` was still too slow for you, then you might have to implement your own function in C or C++. I recently showed an [example with multi-threading](https://stackoverflow.com/a/55666677/5793905) using a geographical distance, but you could adjust it to support strings and output the complete matrix. Also check [this example](http://gallery.rcpp.org/articles/parallel-distance-matrix/). — Alexis, May 04 '19 at 13:53

score 2 · Answer 1 · answered May 02 '19 at 15:51

If you want a matrix, you can use the stringdist package. From what I could tell, the package you were using calculated Levenshtein Distance so I included method = "lv" (you could try other methods too). Let me know if you have issues, or if a format other than a matrix would be preferred. Also, you may consider using a method other than Levenshtein Distance (i.e., a change of 2 in a four letter word appears the same as a change of two in a 20 word sentence). Good luck!!!

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")

rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string

dist_mat
                        hello world hello vorld hello world 1 hello world hello world hello world
hello world                       0           1             2           0                      12
hello vorld                       1           0             3           1                      13
hello world 1                     2           3             0           2                      11
hello world                       0           1             2           0                      12
hello world hello world          12          13            11          12                       0

thanks I am aware of stringdist but lv is not good enough - see Token_set_ratio. hence my implementation from scratch ... — cs0815, May 02 '19 at 15:52
Ahh I see, I didn't catch that when I first looked at it. Apologies! When I ran your code it gave me a matrix of `NA`'s so I couldn't check the output. I am having trouble understanding `Token_set_ratio`. I still can't get it to work for me but it may be something simple that I am missing. Could you explain a little about why you prefer it over other distance measures (i.e., why it best suites your data). Looks like it extracts alpha-numeric characters, makes a ratio of intersecting vs leftover characters. So, would "green" and "genre" be a perfect match? — Andrew, May 02 '19 at 16:28
You mentioned the apply family in your question. If it works the same way that regular string distance works, you could always use `sapply` and it will output a matrix for this. For example (using `stringdist`): `sapply(data$string, stringdist, data$string, method = "lv")` — Andrew, May 02 '19 at 16:31

create distance matrix for strings

1 Answers1

Linked

Related