1

I have 2 data frames as mentioned below:

df1 <- data.frame(ID=c(1,1,1,2,3,4,4), actual.date=c('10/01/1996','10/02/1996','5/01/2002','7/01/1999','9/01/2005','5/01/2006','2/03/2003'),
val=c(5,10,15,20,25,30,35))
dcis <- grep('date$',names(df1))
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y')
df1

df2 <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004'))
dcis <- grep('date$',names(df2))
df2[dcis] <- lapply(df2[dcis],as.Date,'%m/%d/%Y')
df2

Requirement -> I will start with each row of df2 and see how many rows in df1 lies within (inclusive range) date range specified in that row of df2 (grouped on ID).

For ex: for 1st row in df2, there are 2 rows in df1 (1st and 2nd) which has ID=1 and lies in date range of 1st row of df2. In the final output, I will sum 5+10 (from column 'val') and put against 1st row of df2.

akrun
  • 874,273
  • 37
  • 540
  • 662
Prateek
  • 61
  • 6

3 Answers3

1

We can use a non-equi join in data.table

library(data.table)
val1 <- setDT(df1)[df2, sum(val), on = .(ID, actual.date >= before.date, 
           actual.date <= after.date), by = .EACHI]$V1
df2$val <- val1
df2$val
#[1] 15 15 15 NA 25 35 NA NA

Or another option is a fuzzy_join

library(fuzzyjoin)
library(dplyr)
fuzzy_right_join(df1, df2, by = c("ID",
    "actual.date" = "before.date", "actual.date"= "after.date"), 
      match_fun = list(`==`, `>=`, `<=`)) %>%
 group_by(ID = ID.y, before.date, after.date) %>% 
 summarise(val = sum(val, na.rm = TRUE))
# A tibble: 8 x 4
# Groups:   ID, before.date [8]
#     ID before.date after.date   val
#  <dbl> <date>      <date>     <dbl>
#1     1 1996-10-01  1996-12-01    15
#2     1 1998-01-01  2003-09-30    15
#3     1 2000-01-01  2004-12-31    15
#4     2 2001-01-01  2006-03-31     0
#5     3 2001-01-01  2006-09-30    25
#6     4 2001-01-01  2005-09-30    35
#7     4 2004-10-01  2004-12-30     0
#8     4 2004-10-03  2004-11-28     0
akrun
  • 874,273
  • 37
  • 540
  • 662
0

I am not sure if this is what your are after. Here is a base R solution

df2$res <- apply(df2,1, function(x) sum(df1$val[df1$ID == x["ID"] 
                                                & df1$actual.date>= x["before.date"] 
                                                & df1$actual.date<= x["after.date"]]))

such that

> df2
  ID before.date after.date res
1  1  1996-10-01 1996-12-01  15
2  1  1998-01-01 2003-09-30  15
3  1  2000-01-01 2004-12-31  15
4  2  2001-01-01 2006-03-31   0
5  3  2001-01-01 2006-09-30  25
6  4  2001-01-01 2005-09-30  35
7  4  2004-10-01 2004-12-30   0
8  4  2004-10-03 2004-11-28   0
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

Maybe this a duplicate post

dplyr left_join by less than, greater than condition

df1 %>% 
  mutate(dummy=TRUE) %>%
  left_join(df2 %>% mutate(dummy=TRUE)) %>%
  filter(actual.date >= before.date, actual.date < after.date) %>%
  select(-dummy)

We also have fuzzyjoin.

library(fuzzyjoin)

fuzzy_inner_join(df1,df2,
                  by = c("actual.date" = "before.date","actual.date" = "after.date","ID" = "ID"),
                  match_fun = list(`>=`, `<=`,`==`))
Bruno
  • 4,109
  • 1
  • 9
  • 27