R: merge based on multiple conditions (with non-equal criteria)

Question

I would like to merge 2 data frames based on multiple conditions.

DF1 <- data.frame("col1" = rep(c("A","B"), 18),
                  "col2" = rep(c("C","D","E"), 12),
                  "value"= (sample(1:100,36)),
                  "col4" = rep(NA,36))

DF2 <- data.frame("col1" = rep("A",6),
                  "col2" = rep(c("C","D"),3),
                  "data" = rep(c(1,3),3),
                  "min" = seq(0,59,by=10),
                  "max" = seq(10,69,by=10))


> DF1
   col1 col2 value col4
1     A    C    22   NA
2     B    D    58   NA
3     A    E    35   NA
4     B    C    86   NA
5     A    D    37   NA
6     B    E    16   NA
7     A    C    46   NA
8     B    D    23   NA
9     A    E    88   NA
10    B    C     3   NA
11    A    D    33   NA
12    B    E    25   NA
13    A    C    19   NA
14    B    D    24   NA
15    A    E     9   NA
16    B    C    76   NA
17    A    D    62   NA
18    B    E    68   NA
19    A    C    97   NA
20    B    D    43   NA
21    A    E     8   NA
22    B    C    84   NA
23    A    D    36   NA
24    B    E    20   NA
25    A    C    57   NA
26    B    D    99   NA
27    A    E    42   NA
28    B    C    64   NA
29    A    D    87   NA
30    B    E     1   NA
31    A    C    78   NA
32    B    D    34   NA
33    A    E    41   NA
34    B    C    32   NA
35    A    D    10   NA
36    B    E    72   NA

> DF2
  col1 col2 data min max
1    A    C    1   0  10
2    A    D    3  10  20
3    A    C    1  20  30
4    A    D    3  30  40
5    A    C    1  40  50
6    A    D    3  50  60

DF1 is the main table and DF2 is treated as a lookup table

If col1 and col2 of DF1 match that of DF2, and 'value' of DF1 is in between min and max of DF2, then column 'data' from DF2 will be added to DF1. If the conditions are not met, 'data' of DF1 will have value of NA.

Expected output (first 6 rows):

  col1 col2 value col4 data
1    A    C    22   NA    1
2    B    D    58   NA   NA
3    A    E    35   NA   NA
4    B    C    86   NA   NA
5    A    D    37   NA    3
6    B    E    16   NA   NA

I've tried using merge (to match col1 snd col2) then subset (to filter only rows that have value in between min and max) , but my goal is to maintain all the rows of DF1.

Anyone has an idea on this ?

Possible duplicate of [Complexe non-equi merge in R](https://stackoverflow.com/questions/41043047/complexe-non-equi-merge-in-r) — Luiz Rodrigo, Jul 27 '17 at 12:47

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

With the recent versions of data.table, non-equi joins and update on join are possible:

library(data.table)
head(setDT(DF1)[setDT(DF2), on = c("col1", "col2", "value>=min", "value<=max"), 
                data := data])

   rn col1 col2 value col4 data
1:  1    A    C    22   NA    1
2:  2    B    D    58   NA   NA
3:  3    A    E    35   NA   NA
4:  4    B    C    86   NA   NA
5:  5    A    D    37   NA    3
6:  6    B    E    16   NA   NA

Data

DF1 <- structure(list(rn = 1:36, col1 = c("A", "B", "A", "B", "A", "B", 
"A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", 
"B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", 
"A", "B", "A", "B"), col2 = c("C", "D", "E", "C", "D", "E", "C", 
"D", "E", "C", "D", "E", "C", "D", "E", "C", "D", "E", "C", "D", 
"E", "C", "D", "E", "C", "D", "E", "C", "D", "E", "C", "D", "E", 
"C", "D", "E"), value = c(22L, 58L, 35L, 86L, 37L, 16L, 46L, 
23L, 88L, 3L, 33L, 25L, 19L, 24L, 9L, 76L, 62L, 68L, 97L, 43L, 
8L, 84L, 36L, 20L, 57L, 99L, 42L, 64L, 87L, 1L, 78L, 34L, 41L, 
32L, 10L, 72L), col4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("rn", 
"col1", "col2", "value", "col4"), row.names = c(NA, -36L), class = "data.frame")
DF2 <- structure(list(rn = 1:6, col1 = c("A", "A", "A", "A", "A", "A"
), col2 = c("C", "D", "C", "D", "C", "D"), data = c(1L, 3L, 1L, 
3L, 1L, 3L), min = c(0L, 10L, 20L, 30L, 40L, 50L), max = c(10L, 
20L, 30L, 40L, 50L, 60L)), .Names = c("rn", "col1", "col2", "data", 
"min", "max"), row.names = c(NA, -6L), class = "data.frame")

score 3 · Accepted Answer · answered Jul 27 '17 at 13:21

Your data, changing stringsAsFactors=F

DF1 <- data.frame("col1" = rep(c("A","B"), 18),
              "col2" = rep(c("C","D","E"), 12),
              "value"= (sample(1:100,36)),
              "col4" = rep(NA,36),
              stringsAsFactors=F)

DF2 <- data.frame("col1" = rep("A",6),
              "col2" = rep(c("C","D"),3),
              "data" = rep(c(1,3),3),
              "min" = seq(0,59,by=10),
              "max" = seq(10,69,by=10),
              stringsAsFactors=F)

Using dplyr, 1) merge the two data using left_join, 2) check ifelse value is between min and max rowwise, then 3) unselect min and max columns...

library(dplyr)
left_join(DF1, DF2, by=c("col1","col2")) %>%
  rowwise() %>%
  mutate(data = ifelse(between(value,min,max), data, NA)) %>%
  select(-min, -max)

Not sure if you were expecting to perform some kind of aggregation, but here's the output of the above code

    col1  col2 value  col4  data
 1     A     C    23    NA    NA
 2     A     C    23    NA     1
 3     A     C    23    NA    NA
 4     B     D    59    NA    NA
 5     A     E    57    NA    NA
 6     B     C     8    NA    NA

score 1 · Answer 3 · answered Mar 03 '19 at 23:27

Using my package safejoin which wraps fuzzyjoin functions, you can do :

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
debugonce(safe_left_join)

safe_left_join(DF1, DF2,  ~
                  X("col1") == Y("col1") & 
                  X("col2") == Y("col2") & 
                  X("value") >= Y("min") &
                  X("value") <= Y("max"),
               conflict = ~.x) %>% 
  head(15)
#    col1 col2 value col4 data min max
# 1     A    C    90   NA   NA  NA  NA
# 2     B    D    20   NA   NA  NA  NA
# 3     A    E     8   NA   NA  NA  NA
# 4     B    C    99   NA   NA  NA  NA
# 5     A    D    42   NA   NA  NA  NA
# 6     B    E    37   NA   NA  NA  NA
# 7     A    C    47   NA    1  40  50
# 8     B    D    61   NA   NA  NA  NA
# 9     A    E    55   NA   NA  NA  NA
# 10    B    C    11   NA   NA  NA  NA
# 11    A    D    81   NA   NA  NA  NA
# 12    B    E    48   NA   NA  NA  NA
# 13    A    C    77   NA   NA  NA  NA
# 14    B    D    58   NA   NA  NA  NA
# 15    A    E     3   NA   NA  NA  NA

The conflict argument here tells the function to return only the conflicted columns from the lhs (col1 and col2).

score 0 · Answer 4 · answered Jul 27 '17 at 13:30

0

You can do it in two steps:

final <- merge(DF1,DF2,by=c("col1","col2"),all.x = T)
final$data <- ifelse(final$data>=final$min & final$data<=final$max,final$data,"NULL")

answered Jul 27 '17 at 13:30

anarchy

551
2
23

score 0 · Answer 5 · answered Mar 23 '20 at 07:20

By using all.x=TRUE all rows of DF1 are kept then adjust condition in filter as follows:

iMed=merge(DF1,DF2,by.x=c('col1','col2'),by.y=c('col1','col2'),all.x=TRUE)
Res=iMed[is.na(iMed[,'min'])|is.na(iMed[,'max'])|(iMed[,'value']<=iMed[,'max'] & iMed[,'value']>=iMed[,'min'] ),]

R: merge based on multiple conditions (with non-equal criteria)

5 Answers5

Data

Linked

Related