Matching two very very large vectors with tolerance (fast! but working space sparing)

Question

consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

rounding both vectors is no option!

for example consider:

referencelist <- read.table(header=TRUE, text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315, 159.02991, 154.07688, 156.77312)

so I get the result:

    name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

what I can do is using e.g. the outer function like

myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])

or I could use a for() loop.

But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for() or chained for() loop this will took days/weeks to finish.

Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?

Thanks for any help!

Best whishes

Is the reference vector sorted? If so, you can do a bisection method to find the closest number. If not -- sort it! Also, how is it even stored? It is too big to fit in your RAM. Is it a single file or does it span multiple files? — John Coleman, Oct 26 '17 at 15:01
In the example you give, all values appear to be in a narrow enough range (154, 160). Can we simplify the problem by calculating `(x - y) / 154` instead of `(x - y) / y`? — Aurèle, Oct 26 '17 at 16:07
@john Coleman: Yes it is sorted ascending. Can you give an example of what is an bisection method? Have not heared it before, I am sry. The reference vector fits into my RAM and then I have around 64 GB free space. — JmO, Oct 26 '17 at 16:16
@aurele I do not know 100% what you mean, but if I understood you correctly then no we cannot, because the range in both vectors goes from around 100 up to 3000. — JmO, Oct 26 '17 at 16:18
I meant to say "binary search". Look at the base R function `findInterval()` — John Coleman, Oct 26 '17 at 16:56

score 4 · Answer 1 · answered Oct 26 '17 at 17:32

4

Using data.table (and copy-pasting from @eddi's binary search (also called bisection, cf @John Coleman's comment)):

library(data.table)

dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")

tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]

#       value I      ref name       diff
# 1: 154.0032 1 154.0031    A 0.19480121
# 2: 159.0299 7 159.0299    G 0.06288125
# 3: 154.0769 2 154.0769    B 0.19470799
# 4: 156.7731 5 156.7731    E 0.12757289

I haven't benchmarked memory usage nor execution time, but data.table has the reputation of being very good at both. If it doesn't work for you, say so and maybe I'll try to benchmark things.

Note: my use of data.table is quite naive.

And there's a solution using findInterval just below: https://stackoverflow.com/a/29552922/6197649, but I'd expect it to perform worse (again: would require benchmarks).

answered Oct 26 '17 at 17:32

Aurèle

12,545
1
31
49

thank you a lot, this sounds very nice. I will test it as soon as possible, but I am sry to say that this might take some hours/days because of the size of the vectors. Back soon and thanks again! – JmO Oct 26 '17 at 18:01
just a short question beforehand, does sample needs to be a single vector or could it be a data frame or matrix with the possibility to select the column sample with the values in it? – JmO Oct 26 '17 at 18:10
1

@JmO Linear search is `O(n)`. For `n = 10^12` that is prohibitive, especially when you want to do it `10^7` times. On the other hand, binary search is `O(log_2(n))`. The base 2 logarithm of 10^12 is about 40. Note that 40 steps per search rather than 10^12 is a speed-up by a factor of 25 billion. It won't take days or even hours. Just a couple of minutes should suffice. – John Coleman Oct 26 '17 at 21:18
thanks a lot both of you!! I will implement it today. For the answer above: As I have not really worked with data.table does this give me only the nearest match or every match in the tolerance range?: Consider I have an F = 154.0033 in the reference list. Now for 156.0032 in the sample there are A and F from the reference list in tolerancerange and I want to have both in the resulting list in seperate rows. Is that possible with this approach or does it only give me the nearest one? – JmO Oct 27 '17 at 05:21
a solution like roll="nearest" +- 2 should be enough. Is there any way to implement this? would be very very nice. Thanks a lot in advance for every help here. – JmO Oct 27 '17 at 07:15

score 3 · Accepted Answer · answered Oct 28 '17 at 10:44

Your match condition

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

can be re-written as

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

with eps = 0.5E-6.

Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

which reproduces the expected result:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

In response to OP's comment, let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

Matching two very very large vectors with tolerance (fast! but working space sparing)

2 Answers2

Linked