non-equi join in dplyr

Question

I use dplyr as a lot in my code and frequently have to accomplish non-equi left join on dates. I want to do it using tidy syntax and pipes. I saw a package named fuzzyjoin which accomplish this, however the client I am working with don't have this package on their server. They have a standard set of packages and getting new packages installed is difficult. For now, I have written this function, non_equi_left_join that helps me with what I am trying to accomplish:

    library(data.table)
    library(dplyr)
    library(lubridate)
      
    # lst_conditions: conditions for the left join
    non_equi_left_join <- function(df_lhs, df_rhs, lst_conditions) {
      sel_columns_lhs <- unique(unlist(map(lst_conditions, 1)))
      sel_columns_rhs <- unique(unlist(map(lst_conditions, 3)))
      
      setDT(df_lhs)[, paste0(sel_columns_lhs, "_tempJoin") := mget(sel_columns_lhs)]
      setDT(df_rhs)[, paste0(sel_columns_rhs, "_tempJoin") := mget(sel_columns_rhs)]
      
      onkey <- unlist(paste(
        paste0(map(lst_conditions, 3), "_tempJoin"),
        map(lst_conditions, 2),
        paste0(map(lst_conditions, 1), "_tempJoin")
        ))
      
      df_ret <-
        df_rhs[df_lhs, on = onkey]
      
      df_ret[, grep(".*_tempJoin$", colnames(df_ret)):=NULL]
      return(setDF(df_ret) %>% as_tibble())
      
    }

The _tempJoin part is to get rid of behavior listed in this post: fuzzyjoin two data frames using data.table

I create a copy of all the columns involved in join and get rid of them right after the join. The function helps me write the following code:

    df1 <- data.frame(
      ID = c(1L,2L,3L,3L),
      CallDate  = c(ymd("2021-01-20"), ymd("2021-02-20"), ymd("2021-02-04"), ymd("2021-04-02"))
    ) %>% as_tibble()
    
    df2 <- data.frame(
      ID = c(1L,2L,3L,3L),
      CustomerType = c("basic", "basic", "basic", "premium"),
      StartDate = c(ymd("2021-02-02"), ymd("2021-02-02"), ymd("2021-02-02"), ymd("2021-02-05")),
      EndDate   = c(ymd("2099-01-01"), ymd("2099-01-01"), ymd("2021-02-05"), ymd("2099-01-01"))
    ) %>% as_tibble()
    
    
    lst_condition <- 
      list(c("ID", "==", "ID"),
           c("CallDate", "<=", "StartDate"),
           c("CallDate", ">=", "EndDate") )
    
    df1 %>%
      non_equi_left_join(., df2, lst_condition)

I know it is crude but working fine on the examples I tested it with. I am worried about cases where it might silently give me incorrect results. Need an expert opinion here. The data I work with is usually several hundred MB data frames.

Thank you.

You would probably need to include the license and should give credit to the creator, but since `fuzzyjoin` is open source, couldn't you just copy and paste the source code for the functions you need into your "custom" code? A sort of poor man's way to install a new package. — Marcus, Mar 12 '21 at 22:11

score 0 · Answer 1 · answered Jan 30 '22 at 20:24

Why not use data.table throughout? I think the below will give you what you are looking for:

df1 <- data.table(df1)
df2 <- data.table(df2)

df2[df1, on = .(ID == ID, StartDate <= CallDate, EndDate >= CallDate), 
              .(x.ID, x.CustomerType, x.StartDate, x.EndDate, i.ID, CallDate)]

non-equi join in dplyr

1 Answers1