I use dplyr as a lot in my code and frequently have to accomplish non-equi left join on dates. I want to do it using tidy syntax and pipes. I saw a package named fuzzyjoin which accomplish this, however the client I am working with don't have this package on their server. They have a standard set of packages and getting new packages installed is difficult. For now, I have written this function, non_equi_left_join that helps me with what I am trying to accomplish:
library(data.table)
library(dplyr)
library(lubridate)
# lst_conditions: conditions for the left join
non_equi_left_join <- function(df_lhs, df_rhs, lst_conditions) {
sel_columns_lhs <- unique(unlist(map(lst_conditions, 1)))
sel_columns_rhs <- unique(unlist(map(lst_conditions, 3)))
setDT(df_lhs)[, paste0(sel_columns_lhs, "_tempJoin") := mget(sel_columns_lhs)]
setDT(df_rhs)[, paste0(sel_columns_rhs, "_tempJoin") := mget(sel_columns_rhs)]
onkey <- unlist(paste(
paste0(map(lst_conditions, 3), "_tempJoin"),
map(lst_conditions, 2),
paste0(map(lst_conditions, 1), "_tempJoin")
))
df_ret <-
df_rhs[df_lhs, on = onkey]
df_ret[, grep(".*_tempJoin$", colnames(df_ret)):=NULL]
return(setDF(df_ret) %>% as_tibble())
}
The _tempJoin part is to get rid of behavior listed in this post: fuzzyjoin two data frames using data.table
I create a copy of all the columns involved in join and get rid of them right after the join. The function helps me write the following code:
df1 <- data.frame(
ID = c(1L,2L,3L,3L),
CallDate = c(ymd("2021-01-20"), ymd("2021-02-20"), ymd("2021-02-04"), ymd("2021-04-02"))
) %>% as_tibble()
df2 <- data.frame(
ID = c(1L,2L,3L,3L),
CustomerType = c("basic", "basic", "basic", "premium"),
StartDate = c(ymd("2021-02-02"), ymd("2021-02-02"), ymd("2021-02-02"), ymd("2021-02-05")),
EndDate = c(ymd("2099-01-01"), ymd("2099-01-01"), ymd("2021-02-05"), ymd("2099-01-01"))
) %>% as_tibble()
lst_condition <-
list(c("ID", "==", "ID"),
c("CallDate", "<=", "StartDate"),
c("CallDate", ">=", "EndDate") )
df1 %>%
non_equi_left_join(., df2, lst_condition)
I know it is crude but working fine on the examples I tested it with. I am worried about cases where it might silently give me incorrect results. Need an expert opinion here. The data I work with is usually several hundred MB data frames.
Thank you.