Find all rows from one dataframe based on date looked into the date range in other dataframe in R

Question

I have 2 data frames as mentioned below:

df1 <- data.frame(ID=c(1,1,1,2,3,4,4), actual.date=c('10/01/1996','10/02/1996','5/01/2002','7/01/1999','9/01/2005','5/01/2006','2/03/2003'),
val=c(5,10,15,20,25,30,35))
dcis <- grep('date$',names(df1))
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y')
df1

df2 <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004'))
dcis <- grep('date$',names(df2))
df2[dcis] <- lapply(df2[dcis],as.Date,'%m/%d/%Y')
df2

Requirement -> I will start with each row of df2 and see how many rows in df1 lies within (inclusive range) date range specified in that row of df2 (grouped on ID).

For ex: for 1st row in df2, there are 2 rows in df1 (1st and 2nd) which has ID=1 and lies in date range of 1st row of df2. In the final output, I will sum 5+10 (from column 'val') and put against 1st row of df2.

This fake data doesn't seem to work, could you use dput – Bruno Dec 26 '19 at 20:02 — Bruno, Dec 26 '19 at 20:02

akrun · Answer 1 · 2019-12-26T20:33:56.603

We can use a non-equi join in data.table

library(data.table)
val1 <- setDT(df1)[df2, sum(val), on = .(ID, actual.date >= before.date, 
           actual.date <= after.date), by = .EACHI]$V1
df2$val <- val1
df2$val
#[1] 15 15 15 NA 25 35 NA NA

Or another option is a fuzzy_join

library(fuzzyjoin)
library(dplyr)
fuzzy_right_join(df1, df2, by = c("ID",
    "actual.date" = "before.date", "actual.date"= "after.date"), 
      match_fun = list(`==`, `>=`, `<=`)) %>%
 group_by(ID = ID.y, before.date, after.date) %>% 
 summarise(val = sum(val, na.rm = TRUE))
# A tibble: 8 x 4
# Groups:   ID, before.date [8]
#     ID before.date after.date   val
#  <dbl> <date>      <date>     <dbl>
#1     1 1996-10-01  1996-12-01    15
#2     1 1998-01-01  2003-09-30    15
#3     1 2000-01-01  2004-12-31    15
#4     2 2001-01-01  2006-03-31     0
#5     3 2001-01-01  2006-09-30    25
#6     4 2001-01-01  2005-09-30    35
#7     4 2004-10-01  2004-12-30     0
#8     4 2004-10-03  2004-11-28     0

score 0 · Answer 2 · answered Dec 26 '19 at 20:08

I am not sure if this is what your are after. Here is a base R solution

df2$res <- apply(df2,1, function(x) sum(df1$val[df1$ID == x["ID"] 
                                                & df1$actual.date>= x["before.date"] 
                                                & df1$actual.date<= x["after.date"]]))

such that

> df2
  ID before.date after.date res
1  1  1996-10-01 1996-12-01  15
2  1  1998-01-01 2003-09-30  15
3  1  2000-01-01 2004-12-31  15
4  2  2001-01-01 2006-03-31   0
5  3  2001-01-01 2006-09-30  25
6  4  2001-01-01 2005-09-30  35
7  4  2004-10-01 2004-12-30   0
8  4  2004-10-03 2004-11-28   0

score 0 · Answer 3 · answered Dec 26 '19 at 20:31

Maybe this a duplicate post

dplyr left_join by less than, greater than condition

df1 %>% 
  mutate(dummy=TRUE) %>%
  left_join(df2 %>% mutate(dummy=TRUE)) %>%
  filter(actual.date >= before.date, actual.date < after.date) %>%
  select(-dummy)

We also have fuzzyjoin.

library(fuzzyjoin)

fuzzy_inner_join(df1,df2,
                  by = c("actual.date" = "before.date","actual.date" = "after.date","ID" = "ID"),
                  match_fun = list(`>=`, `<=`,`==`))

Find all rows from one dataframe based on date looked into the date range in other dataframe in R

3 Answers3