1

I have a character vector and want to create a matrix with distance metrices for each pair of vector values (using the stringdist package). Currently, I have an implementation with nested for-loops:

library(stringdist)

strings <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic")
m <- matrix(nrow = length(strings), ncol = length(strings))
colnames(m) <- strings
rownames(m) <- strings

for (i in 1:nrow(m)) {
  for (j in 1:ncol(m)) {
    m[i,j] <- stringdist::stringdist(tolower(rownames(m)[i]), tolower(colnames(m)[j]), method = "lv")
  }
}

which results in following matrix:

> m
         Hello Helo Hole Apple Ape New Old System Systemic
Hello        0    1    3     4   5   4   4      6        7
Helo         1    0    2     4   4   3   3      6        7
Hole         3    2    0     3   3   4   2      5        7
Apple        4    4    3     0   2   5   4      5        7
Ape          5    4    3     2   0   3   3      5        7
New          4    3    4     5   3   0   3      5        7
Old          4    3    2     4   3   3   0      6        8
System       6    6    5     5   5   5   6      0        2
Systemic     7    7    7     7   7   7   8      2        0

However, if I have - for instance - a vector of lenght 1000 with many non-unique values, this matrix is quite large (let's say, 800 rows by 800 columns) and the loops are very slow. I like to optimize the performace, e.g. by using apply functions, but I don't know how to translate the above code into an apply syntax. Can anyone help?

Daniel
  • 7,252
  • 6
  • 26
  • 38
  • `apply` is also looping, and not necessarily faster than a for loop. See also http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar – Joris Meys Sep 03 '14 at 12:04
  • Code optimization questions should be asked on CodeReview rather than StackOverflow http://codereview.stackexchange.com/ – Hack-R Jun 26 '16 at 16:08

4 Answers4

2

When using nested loops, it's always interesting to check whether outer() doesn't do the job for you. outer() is a vectorized solution for nested loops; it applies a vectorized function to every possible combination of the elements in the first two arguments. as stringdist() works on vectors, you can simply do:

library(stringdist)
strings <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", 
             "Old", "System", "Systemic")

outer(strings,strings,
      function(i,j){
        stringdist(tolower(i),tolower(j))
      })

gives you the desired result.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
2

Bioconductor has a stringDist function that can do this for you:

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

library(Biostrings)

stringDist(c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic"), upper=TRUE)

##   1 2 3 4 5 6 7 8 9
## 1   1 3 4 5 4 4 6 7
## 2 1   2 4 4 3 3 6 7
## 3 3 2   3 3 4 3 5 7
## 4 4 4 3   2 5 4 5 7
## 5 5 4 3 2   3 3 5 7
## 6 4 3 4 5 3   3 5 7
## 7 4 3 3 4 3 3   6 8
## 8 6 6 5 5 5 5 6   2
## 9 7 7 7 7 7 7 8 2
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 2
    Thanks a lot and shame on me: the `stringdist` package has such a function as well: `stringdistmatrix` – Daniel Sep 03 '14 at 12:09
  • 1
    You could/should post that as an answer and de-accept mine and accept that (points!). I've got "bioconductor" on the mind these days (building something similar for infosec) and it's prbly overkill for the answer. – hrbrmstr Sep 03 '14 at 12:15
2

Thanks to the hint of @hrbrmstr I found out that the stringdist package itself provides a function called stringdistmatrix, which does what I was asking for (see here).

The function call is simply: stringdistmatrix(strings, strings)

Daniel
  • 7,252
  • 6
  • 26
  • 38
0

Here's an easy one to start with: the matrix is symmetric, so there's no need to calculate the entries below the diagonal. m[j][i] = m[i][j]. And obviously the diagonal elements are all zero, so there's no need to bother with those.

Like this:

for (i in 1:nrow(m)) {
  m[i][i] <- 0
  for (j in (i+1):ncol(m)) {
    m[i,j] <- stringdist::stringdist(tolower(rownames(m)[i]), tolower(colnames(m)[j]), method = "lv")
    m[j,i] <- m[i,j]
  }
}
duffymo
  • 305,152
  • 44
  • 369
  • 561