binary search with data table, include second, third, fourth,.. nearest value

Question

consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

rounding both vectors is no option!

for example consider:

referencelist <- read.table(header=TRUE, text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315, 159.02991, 154.07688, 156.77312)

so I get the result:

  name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

I got the beautiful and very fast binary search solution here: Matching two very very large vectors with tolerance (fast! but working space sparing)

library(data.table)

dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")

tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]

#       value I      ref name       diff
# 1: 154.0032 1 154.0031    A 0.19480121
# 2: 159.0299 7 159.0299    G 0.06288125
# 3: 154.0769 2 154.0769    B 0.19470799
# 4: 156.7731 5 156.7731    E 0.12757289

But here additional questions came up where I really dont know how to proceed and would be happy for any further help:

First: what happens here when I have e.g. a F = 154.0033 in the reference list. Then my sample value 154.0032 is not only in the tolerance range of A like above but also in the range of F. Nevertheless the data.table approach only gives me the nearest value. How can I get the second, third.. etc nearest values but still use data.table, as this is the only solution fast enough for my large datasets. If it would be possible to get the rownumber it might be possible to just go +-x from the row of the nearest match along the reference list to obtain all possible values in tolerance range because its ordered. So is there something like a

dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest" +-x , by = .EACHI]

?

Second: when using

dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]

sample is just a vector. But what if sample is a dataframe with a column value that is the key for matching with the referencelist but has other 100+ columns that should stay in the resulting data.table. I really tried to understand the data.tables syntax but do not managed to do that. Might someone help me here, too?

here as an example:

sample <- data.frame(value=c(154.00315, 159.02991, 154.07688, 156.77312),replicate(100,sample(0:1,4,rep=TRUE)))

e.g. getting something like this:

      value   I   ref     name  diff      X1 ... X100
# 1: 154.0032 1 154.0031    A  0.19480121  X       X
# 2: 159.0299 7 159.0299    G  0.06288125  X       X
# 3: 154.0769 2 154.0769    B  0.19470799  X       X
# 4: 156.7731 5 156.7731    E  0.12757289  X       X

Please, ask only *one* question at a time. For your second question (which should go into a separate post), please, provide a suitable [mcve]. Thank you. — Uwe, Oct 28 '17 at 09:58
I added a example data frame. But if this is something more complicated to solve I will ask a new question. Many thanks for your answer! — JmO, Oct 28 '17 at 12:43

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Your match condition

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

can be re-written as

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

with eps = 0.5E-6.

Using this, we can use a non-equi-join to find all matches in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

which reproduces the expected result:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

Let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

By the way, the expression
.(value = sample, lower = sample * (1 - eps), upper = sample * (1 + eps))
is creating a data.table on the fly. We could have another datat.table to join with, here.

Edit: `sample` given as data frame

To illustrate the second part of his question, the OP has given sample as a data frame with 100+ columns:

sample <- data.frame(value = c(154.00315, 159.02991, 154.07688, 156.77312),
                     replicate(100L, sample(0:1, 4L, rep = TRUE)))

This can be handled by a non-equi join as well but requires requires a few modifications:

eps <- 0.5E-6 # tol * 1E6
setDT(referencelist2)[
  # modify referencelist2 by copying ref column to preserve its value
  , reference := ref][
    # modify sample by appending lower and upper for non-equi join
    setDT(sample)[, c("lower", "upper") := .(value * (1 - eps), value * (1 + eps))], 
    on = .(ref > lower, ref < upper)][
      # remove upper and lower columns (renamed to ref and ref.1 during the join)
      , -c("ref", "ref.1")]

   name reference     value X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 ...
1:    A 154.00312 154.00315  0  1  1  0  0  0  0  1  0   0   0   1   0   0   0   0   0
2:    F 154.00320 154.00315  0  1  1  0  0  0  0  1  0   0   0   1   0   0   0   0   0
3:    G 159.02992 159.02991  0  0  0  1  0  1  1  0  0   0   1   1   1   0   1   1   0
4:    B 154.07685 154.07688  0  1  1  1  1  1  1  1  1   1   0   1   1   1   0   0   0
5:    E 156.77310 156.77312  1  0  1  0  1  0  0  1  1   1   0   1   0   1   0   1   0

thanks a lot will try that as soon as I am back on my desk – JmO Oct 28 '17 at 10:48 — JmO, Oct 28 '17 at 10:48

binary search with data table, include second, third, fourth,.. nearest value

1 Answers1

Edit: sample given as data frame

Edit: `sample` given as data frame