4

I have a data.table dt of 9k rows (see sample below). I need to compare each rname of dt to each cname of a reference data.table dt.ref. By comparing, I mean computing the Levenshtein ratio.

Then, I take the maximum and get my output (see below).

dt

nid | rname  | maxr
n1  | apple  |  0.5
n2  | pear   |  0.8
n3  | banana |  0.7
n4  | kiwi   |  0.6
... (9k)

dt.ref

cid | cname
c1  | apple
c2  | maple
c3  | peer
c4  | dear
c5  | bonobo
c6  | kiwis
... (75k)

output

nid | rname  | maxr | maxLr | cid
n1  | apple  |  0.5 | 1     | c1
n2  | pear   |  0.8 | 0.75  | c3
n2  | pear   |  0.8 | 0.75  | c4
n3  | banana |  0.7 | 0.33  | c5
n4  | kiwi   |  0.6 | 0.8   | c6
...

To compute this output, I use the stringdistmatrix function inside a function coded this way (see Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio):

f1 <- function(x, y) {
  require(stringdist)
  require(matrixStats)
  dis  <- stringdistmatrix(x, y, method = "lv")
  mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
  r <- 1 - dis / mat
  w <- apply(r, 1, function(x) which(x==max(x)))
  m <- rowMaxs(r)
  list(m = m, w = w)
}

r <- f1(dt[[2]], dt.ref[[2]])
dt[, maxLr := r$m ]
dtnew <- dt[rep(1:.N, lengths(r$w)),]
dtnew[, cid := dt.ref[unlist(r$w), 1]]

However, with a 9k x 75k matrix, I have a memory issue that makes the R session abort. Beyond splitting up the 9k table, would it be a way to:

  1. code differently using only data.table and not a matrix?
  2. sort and split the reference table to compute the Levenshtein ratio only on a subset of the 75k strings?
user2590177
  • 167
  • 1
  • 11

0 Answers0