I have a data.table
dt of 9k rows (see sample below). I need to compare each rname of dt to each cname of a reference data.table
dt.ref. By comparing, I mean computing the Levenshtein ratio.
Then, I take the maximum and get my output (see below).
dt
nid | rname | maxr
n1 | apple | 0.5
n2 | pear | 0.8
n3 | banana | 0.7
n4 | kiwi | 0.6
... (9k)
dt.ref
cid | cname
c1 | apple
c2 | maple
c3 | peer
c4 | dear
c5 | bonobo
c6 | kiwis
... (75k)
output
nid | rname | maxr | maxLr | cid
n1 | apple | 0.5 | 1 | c1
n2 | pear | 0.8 | 0.75 | c3
n2 | pear | 0.8 | 0.75 | c4
n3 | banana | 0.7 | 0.33 | c5
n4 | kiwi | 0.6 | 0.8 | c6
...
To compute this output, I use the stringdistmatrix
function inside a function coded this way (see Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio):
f1 <- function(x, y) {
require(stringdist)
require(matrixStats)
dis <- stringdistmatrix(x, y, method = "lv")
mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
r <- 1 - dis / mat
w <- apply(r, 1, function(x) which(x==max(x)))
m <- rowMaxs(r)
list(m = m, w = w)
}
r <- f1(dt[[2]], dt.ref[[2]])
dt[, maxLr := r$m ]
dtnew <- dt[rep(1:.N, lengths(r$w)),]
dtnew[, cid := dt.ref[unlist(r$w), 1]]
However, with a 9k x 75k matrix, I have a memory issue that makes the R session abort. Beyond splitting up the 9k table, would it be a way to:
- code differently using only
data.table
and not a matrix? - sort and split the reference table to compute the Levenshtein ratio only on a subset of the 75k strings?