consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:
matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
rounding both vectors is no option!
for example consider:
referencelist <- read.table(header=TRUE, text="value name
154.00312 A
154.07685 B
154.21452 C
154.49545 D
156.77310 E
156.83991 F
159.02992 G
159.65553 H
159.93843 I")
sample <- c(154.00315, 159.02991, 154.07688, 156.77312)
so I get the result:
name value reference
1 A 154.00315 154.00312
2 G 159.02991 159.02992
3 B 154.07688 154.07685
4 E 156.77312 156.77310
I got the beautiful and very fast binary search solution here: Matching two very very large vectors with tolerance (fast! but working space sparing)
library(data.table)
dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")
tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]
# value I ref name diff
# 1: 154.0032 1 154.0031 A 0.19480121
# 2: 159.0299 7 159.0299 G 0.06288125
# 3: 154.0769 2 154.0769 B 0.19470799
# 4: 156.7731 5 156.7731 E 0.12757289
But here additional questions came up where I really dont know how to proceed and would be happy for any further help:
First: what happens here when I have e.g. a F = 154.0033 in the reference list. Then my sample value 154.0032 is not only in the tolerance range of A like above but also in the range of F. Nevertheless the data.table approach only gives me the nearest value. How can I get the second, third.. etc nearest values but still use data.table, as this is the only solution fast enough for my large datasets. If it would be possible to get the rownumber it might be possible to just go +-x from the row of the nearest match along the reference list to obtain all possible values in tolerance range because its ordered. So is there something like a
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest" +-x , by = .EACHI]
?
Second: when using
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
sample is just a vector. But what if sample is a dataframe with a column value
that is the key for matching with the referencelist but has other 100+ columns that should stay in the resulting data.table. I really tried to understand the data.tables syntax but do not managed to do that. Might someone help me here, too?
here as an example:
sample <- data.frame(value=c(154.00315, 159.02991, 154.07688, 156.77312),replicate(100,sample(0:1,4,rep=TRUE)))
e.g. getting something like this:
value I ref name diff X1 ... X100
# 1: 154.0032 1 154.0031 A 0.19480121 X X
# 2: 159.0299 7 159.0299 G 0.06288125 X X
# 3: 154.0769 2 154.0769 B 0.19470799 X X
# 4: 156.7731 5 156.7731 E 0.12757289 X X