Join two data.tables by one numeric variable with dynamic tolerance

Question

I have two data.tables which I would like to join via one numeric variable (double precision). However, the numeric variable is flawed with an uncertainty. Therefor I have to allow for a specific tolerance which is different depending on the variable.

In the example below "mz" is the variable by which I would like to join DT1 and DT2. The tolerance calculated from the variabe iso_mz: iso_mz * 5e-6.

DT1 <- data.table(mz = c(433.231512451172, 451.091953822545, 454.347605202415, 490.167234693255, 518.225894504123), 
Var1 = c(433.231018066406, 451.091430664062, 454.347015380859, 490.166381835938, 518.22509765625), 
Var2 = c(433.232147216797, 451.092559814453, 454.34814453125, 490.168273925781, 518.2265625))


DT2 <- data.table(iso_mz = c(451.0900, 490.1651, 518.2281, 433.2335), 
comp = c("m1", "m2", "m3", "m4"))

If I would not have to use a tolerance I would use the "on=.()" functionality of the data.table package. I tried to adapt the code from Joining Data Frames by Measured Values with an Error Range but for some reason I could not get in running,..

The desired output for my examlpe would be:

Output <- data.table(
iso_mz = c(433.2335, 451.0900, 490.1651, 518.2281), 
comp = c("m4", "m1", "m2", "m3"),
mz = c(433.231512451172, 451.091953822545, 490.167234693255, 518.225894504123), 
Var1 = c(433.231018066406, 451.091430664062, 490.166381835938, 518.22509765625), 
Var2 = c(433.232147216797, 451.092559814453, 490.168273925781, 518.2265625))

Thank you in advance!

Wimpel · Accepted Answer · 2019-07-09T12:23:44.170

Here is an approach using foverlaps() from data.table.

tolerance = 5e-6
#create ranges to join on
DT1[, `:=`(min = mz - mz * tolerance, 
           max = mz + mz * tolerance) ]
DT2[, `:=`(min = iso_mz - iso_mz * tolerance, 
           max = iso_mz + iso_mz * tolerance) ]
#set keys
setkey(DT1, min, max )
setkey(DT2, min, max )
#perform overlap join, order, remove min-max columns
ans <- setorder( foverlaps( DT2, DT1 ), mz)[, `:=`(min=NULL,max=NULL,i.min=NULL,i.max=NULL)][]


# mz     Var1     Var2   iso_mz comp
# 1: 433.2315 433.2310 433.2321 433.2335   m4
# 2: 451.0920 451.0914 451.0926 451.0900   m1
# 3: 490.1672 490.1664 490.1683 490.1651   m2
# 4: 518.2259 518.2251 518.2266 518.2281   m3


#check
all.equal( setcolorder(ans, names(Output)), Output )
[1] TRUE

Thank you for your quick response. Sorry for the cofusion around the mzmin/mzmax variables. They do not indicate the tolerances but are just aditional variables I would like to keep in those rows, which is why I left them in the example. `433.2335 - 433.2335*5e-6 ` [1] 433.2313 `433.2335 + 433.2335*5e-6 ` [1] 433.2357 Hence: 433.2315 is in the range — yasel, Jul 09 '19 at 12:19

Join two data.tables by one numeric variable with dynamic tolerance

1 Answers1