Join data table with multiple conditions

Question

I have this two table

library(data.table)
n = 100
set.seed(12)
d = runif(n)
dtBig = data.table(id = rep(letters[1:2],each = n/2),
                   d1 = d,
                   d2 = d + runif(n))
dtSmall = data.table(id = rep(letters[1:2],each = 2),
                     d_start = c(0.2,0.65,0.15,1.1),
                     d_end = c(0.65,0.85,0.8,1.5))

I would like to do a merge with two conditions on d1 and d2.

I tried that

dtAll = merge(dtSmall, dtBig, by = "id", allow.cartesian = T)
dtAll = dtAll[d1 >= d_start & d2 <= d_end]

However these command create a big table with all possible merge. In this case the tables are small so it's ok but in my case I have so many rows. So I look into the on join like this

dtAll2 = dtSmall[,dtBig[.SD, on = .(id,d1>=d_start)]]
dtAll2 = dtAll[d2 <= d_end]

However, only one condition is allowed with on.

Do you know a command in one line (without saving the big table) ?

Thank for your help.

Try `dtBig[dtSmall, on = .(id, d1 >= d_start, d2 <= d_end)]` — akrun, Nov 24 '22 at 20:08
Related, possible duplicate: https://stackoverflow.com/q/27487949/680068 — zx8754, Nov 24 '22 at 20:30
"*only one condition is allowed with on.*" - I'm confused - you have two conditions in your first join line as it is! — thelatemail, Nov 24 '22 at 21:58
Thanks @akrun, I use your suggestion but I still have some issues with columns. I made a new post for these new issues https://stackoverflow.com/questions/74737517/data-table-join-with-conditions-keep-all-columns?noredirect=1#comment131904998_74737517. — Lolivano, Dec 09 '22 at 00:11

AndS. · Answer 1 · 2022-11-24T20:17:21.110

Could you map it out and them combine?

library(tidyverse)

pmap_dfr(list(dtSmall$id, dtSmall$d_start, dtSmall$d_end), 
         ~filter(dtBig, id == ..1 & d1 >= ..2 & d2 <= ..3) |>
           mutate(d_start = ..2, d_end = ..3))
#>     id        d1        d2 d_start d_end
#>  1:  a 0.3808122 0.4641550    0.20  0.65
#>  2:  a 0.2649184 0.3592956    0.20  0.65
#>  3:  a 0.6332646 0.6490715    0.20  0.65
#>  4:  b 0.3945864 0.6576796    0.15  0.80
#>  5:  b 0.4210900 0.5229448    0.15  0.80
#>  6:  b 0.3220115 0.4054827    0.15  0.80
#>  7:  b 0.5741449 0.6406442    0.15  0.80
#>  8:  b 0.2942439 0.7141591    0.15  0.80
#>  9:  b 0.1595905 0.6743042    0.15  0.80
#> 10:  b 0.1575806 0.3252544    0.15  0.80
#> 11:  b 0.2427686 0.5242777    0.15  0.80
#> 12:  b 0.2621347 0.7723904    0.15  0.80
#> 13:  b 0.3664409 0.4148807    0.15  0.80
#> 14:  b 0.4951412 0.7143280    0.15  0.80
#> 15:  b 0.3111479 0.4680435    0.15  0.80
#> 16:  b 0.3437159 0.5852995    0.15  0.80

you also could do a fuzzy join:

library(fuzzyjoin)

fuzzy_join(dtSmall, 
           dtBig, 
           by = c("id", "d_start" = "d1", "d_end" = "d2"),
           match_fun = c(`==`, `<=`, `>=`))
#>    id.x d_start d_end id.y        d1        d2
#> 1     a    0.20  0.65    a 0.3808122 0.4641550
#> 2     a    0.20  0.65    a 0.2649184 0.3592956
#> 3     a    0.20  0.65    a 0.6332646 0.6490715
#> 4     b    0.15  0.80    b 0.3945864 0.6576796
#> 5     b    0.15  0.80    b 0.4210900 0.5229448
#> 6     b    0.15  0.80    b 0.3220115 0.4054827
#> 7     b    0.15  0.80    b 0.5741449 0.6406442
#> 8     b    0.15  0.80    b 0.2942439 0.7141591
#> 9     b    0.15  0.80    b 0.1595905 0.6743042
#> 10    b    0.15  0.80    b 0.1575806 0.3252544
#> 11    b    0.15  0.80    b 0.2427686 0.5242777
#> 12    b    0.15  0.80    b 0.2621347 0.7723904
#> 13    b    0.15  0.80    b 0.3664409 0.4148807
#> 14    b    0.15  0.80    b 0.4951412 0.7143280
#> 15    b    0.15  0.80    b 0.3111479 0.4680435
#> 16    b    0.15  0.80    b 0.3437159 0.5852995

Join data table with multiple conditions

1 Answers1