2

I have two data frames and for each specific row in data frame one, I am trying to find its closest match from data frame two based on defined important criteria (shift, age, level). So as an example, imagine I have data frame 1:

shift_1 <- c(1,1,0,2)
length_1 <- c(100,120,5,70)
level_1<- c(1,3,5,4)
age_1 <- c(4.5,3.2,3,2.5)

df_1 <- data.frame(shift_1,level_1,age_1,length_1)

  shift_1 level_1 age_1 length_1
1       1       1   4.5      100
2       1       3   3.2      120
3       0       5   3.0        5
4       2       4   2.5       70

and I for each row of this data frame, I want to find the closest match in data frame 2 which is shown below:

shift_2 <- c(1,1,2,1,0)
length_2 <- c(100,200,40,180,10)
level_2<- c(3,4,4,3,5)
age_2 <- c(2.5,5.5,2.2,3.1,5)

df_2 <- data.frame(shift_2,level_2,age_2,length_2)

  shift_2 level_2 age_2 length_2
1       1       3   2.5      100
2       1       4   5.5      200
3       2       4   2.2       40
4       1       3   3.1      180
5       0       5   5.0       10

based on these criteria : shift must match exactly. Level must match exactly. age will be accepted with 20% difference.

if we find a match : we want to add index number of the matched row and match information, otherwise we will put NA. so the expected result is in this form:

  shift_r level_r age_r length_r index shift_match level_match age_match length_match
1       1       1   4.5      100    NA          NA          NA        NA           NA
2       1       3   3.2      120     4           1           3       3.1          180
3       0       5   3.0        5    NA          NA          NA        NA           NA
4       2       4   2.5       70     3           2           4       2.2           40

Can you please advise how I should approach this? is there any library to make this task easier?

Ross_you
  • 881
  • 5
  • 22

1 Answers1

2

You need a "non-equi" or "range" join. This is implemented in fuzzyjoin and data.table packages for R. Since it is also supported in SQL, one can also use sqldf.

Sadly, dplyr does not support this natively. Since this action is supported in SQL, if your data are in a database then dbplyr would allow it using its sql_on, but not natively.

First, we need to add in the 20% tolerance:

df_1$age_1_start <- df_1$age_1 * 0.8
df_1$age_1_end <- df_1$age_1 * 1.2
df_1
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end
# 1       1       1   4.5      100        3.60      5.40
# 2       1       3   3.2      120        2.56      3.84
# 3       0       5   3.0        5        2.40      3.60
# 4       2       4   2.5       70        2.00      3.00

fuzzyjoin

fuzzyjoin::fuzzy_left_join(
  df_1, df_2,
  by = c("shift_1" = "shift_2", "level_1" = "level_2",
         "age_1_start" = "age_2", "age_1_end" = "age_2"),
  match_fun = list(`==`, `==`, `<=`, `>=`))
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

data.table

library(data.table)
DT_1 <- as.data.table(df_1) # must include age_1_start and age_1_end from above
DT_2 <- as.data.table(df_2)

DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)]
#    shift_2 level_2 age_2 length_2 age_2.1 age_1 length_1
# 1:       1       1  3.60       NA    5.40   4.5      100
# 2:       1       3  2.56      180    3.84   3.2      120
# 3:       0       5  2.40       NA    3.60   3.0        5
# 4:       2       4  2.00       40    3.00   2.5       70

This package tends to rename the left (DT_1) join based on the right's names, which may be frustrating. For this, you will need to do some cleanup afterwards.

sqldf

sqldf::sqldf(
  "select t1.*, t2.*
   from df_1 t1
     left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
       and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2")
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

If you know SQL, then the last might be the most intuitive and easiest to absorb. Keep in mind, though, that for larger frames, it is copying the entire frame into a memory-storage SQLite instance ... which is not "free".

The fuzzyjoin implementation gives you a lot of power, and its arguments seem (to me) to be easy to follow. The results are named as I would expect. However, it is the slowest (with this data) of the three implementations. (This should only be a concern if your real data is "very" large.)

If you don't already know data.table, despite its blazing speed, its dialect of R can be a bit obscure to the uninformed. I believe it has as much power as fuzzyjoin, though I haven't tested all corner-cases to see if one supports something the other does not.

bench::mark(
  fuzzyjoin = fuzzyjoin::fuzzy_left_join(
    df_1, df_2,
    by = c("shift_1" = "shift_2", "level_1" = "level_2",
           "age_1_start" = "age_2", "age_1_end" = "age_2"),
    match_fun = list(`==`, `==`, `<=`, `>=`)),
  data.table = DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)],
  sqldf = sqldf::sqldf(
    "select t1.*, t2.*
     from df_1 t1
       left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
         and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2"),
  check = FALSE
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time      gc       
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>    <list>   
# 1 fuzzyjoin  134.12ms 143.24ms      6.98     107KB     6.98     2     2      286ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 2 data.table   2.14ms   2.63ms    335.       114KB     2.06   163     1      487ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 3 sqldf       21.14ms  22.72ms     42.9      184KB     4.52    19     2      442ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • wow! thanks so much r2evans for your answers. Since the "fuzzy join" seems to be easier for me to understand, I would probably stick to that one. Just one question regarding fuzzy join. Is there any reason you didn't include "shift" in your code? Since "shift" was one of the criteria . Also, how does "match_fun = list(`==`, `<=`, `>=`)" work? you defined all columns with "=" and then assigned a list of logical expressions in order? – Ross_you Oct 15 '20 at 00:01