You need a "non-equi" or "range" join. This is implemented in fuzzyjoin
and data.table
packages for R. Since it is also supported in SQL, one can also use sqldf
.
Sadly, dplyr
does not support this natively. Since this action is supported in SQL, if your data are in a database then dbplyr
would allow it using its sql_on
, but not natively.
First, we need to add in the 20% tolerance:
df_1$age_1_start <- df_1$age_1 * 0.8
df_1$age_1_end <- df_1$age_1 * 1.2
df_1
# shift_1 level_1 age_1 length_1 age_1_start age_1_end
# 1 1 1 4.5 100 3.60 5.40
# 2 1 3 3.2 120 2.56 3.84
# 3 0 5 3.0 5 2.40 3.60
# 4 2 4 2.5 70 2.00 3.00
fuzzyjoin
fuzzyjoin::fuzzy_left_join(
df_1, df_2,
by = c("shift_1" = "shift_2", "level_1" = "level_2",
"age_1_start" = "age_2", "age_1_end" = "age_2"),
match_fun = list(`==`, `==`, `<=`, `>=`))
# shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1 1 1 4.5 100 3.60 5.40 NA NA NA NA
# 2 1 3 3.2 120 2.56 3.84 1 3 3.1 180
# 3 0 5 3.0 5 2.40 3.60 NA NA NA NA
# 4 2 4 2.5 70 2.00 3.00 2 4 2.2 40
data.table
library(data.table)
DT_1 <- as.data.table(df_1) # must include age_1_start and age_1_end from above
DT_2 <- as.data.table(df_2)
DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)]
# shift_2 level_2 age_2 length_2 age_2.1 age_1 length_1
# 1: 1 1 3.60 NA 5.40 4.5 100
# 2: 1 3 2.56 180 3.84 3.2 120
# 3: 0 5 2.40 NA 3.60 3.0 5
# 4: 2 4 2.00 40 3.00 2.5 70
This package tends to rename the left (DT_1
) join based on the right's names, which may be frustrating. For this, you will need to do some cleanup afterwards.
sqldf
sqldf::sqldf(
"select t1.*, t2.*
from df_1 t1
left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2")
# shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1 1 1 4.5 100 3.60 5.40 NA NA NA NA
# 2 1 3 3.2 120 2.56 3.84 1 3 3.1 180
# 3 0 5 3.0 5 2.40 3.60 NA NA NA NA
# 4 2 4 2.5 70 2.00 3.00 2 4 2.2 40
If you know SQL, then the last might be the most intuitive and easiest to absorb. Keep in mind, though, that for larger frames, it is copying the entire frame into a memory-storage SQLite instance ... which is not "free".
The fuzzyjoin
implementation gives you a lot of power, and its arguments seem (to me) to be easy to follow. The results are named as I would expect. However, it is the slowest (with this data) of the three implementations. (This should only be a concern if your real data is "very" large.)
If you don't already know data.table
, despite its blazing speed, its dialect of R can be a bit obscure to the uninformed. I believe it has as much power as fuzzyjoin
, though I haven't tested all corner-cases to see if one supports something the other does not.
bench::mark(
fuzzyjoin = fuzzyjoin::fuzzy_left_join(
df_1, df_2,
by = c("shift_1" = "shift_2", "level_1" = "level_2",
"age_1_start" = "age_2", "age_1_end" = "age_2"),
match_fun = list(`==`, `==`, `<=`, `>=`)),
data.table = DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)],
sqldf = sqldf::sqldf(
"select t1.*, t2.*
from df_1 t1
left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2"),
check = FALSE
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fuzzyjoin 134.12ms 143.24ms 6.98 107KB 6.98 2 2 286ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 2 data.table 2.14ms 2.63ms 335. 114KB 2.06 163 1 487ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 3 sqldf 21.14ms 22.72ms 42.9 184KB 4.52 19 2 442ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~