6

I'm trying to aggregate two data frames (df1 and df2).

The first contains 3 variables: ID, Date1 and Date2.

df1

ID      Date1      Date2
 1 2016-03-01 2016-04-01
 1 2016-04-01 2016-05-01
 2 2016-03-14 2016-04-15
 2 2016-04-15 2016-05-17
 3 2016-05-01 2016-06-10
 3 2016-06-10 2016-07-15

The second also contains 3 variables: ID, Date3 and Value.

df2

ID      Date3 Value
 1 2016-03-15     5
 1 2016-04-04     7
 1 2016-04-28     7
 2 2016-03-18     3
 2 2016-03-27     5
 2 2016-04-08     9
 2 2016-04-20     2
 3 2016-05-05     6
 3 2016-05-25     8
 3 2016-06-13     3

The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:

ID      Date1      Date2 SumValue
 1 2016-03-01 2016-04-01        5
 1 2016-04-01 2016-05-01       14
 2 2016-03-14 2016-04-15       17
 2 2016-04-15 2016-05-17        2
 3 2016-05-01 2016-06-10       14
 3 2016-06-10 2016-07-15        3

I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.

EdM
  • 164
  • 7
  • You can use `foverlaps` from `data.table` – akrun Mar 29 '16 at 12:30
  • Possible duplicate of [Checking if Date is Between two Dates in R](http://stackoverflow.com/questions/31353595/checking-if-date-is-between-two-dates-in-r) – Ronak Shah Mar 29 '16 at 12:31

3 Answers3

5

A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):

Do the comparison in J using by=EACHI.

library(data.table)
setDT(df1)
setDT(df2)

df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]

df1[  df2,
      {
        idx = Date1 <= i.Date3 & i.Date3 <= Date2
        .(Date1 = Date1[idx],
          Date2 = Date2[idx],
          Date3 = i.Date3,
          Value = i.Value)
      }, 
      on=c("ID"),
      by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]

#   ID      Date1      Date2 sumValue
# 1:  1 2016-03-01 2016-04-01        5
# 2:  1 2016-04-01 2016-05-01       14
# 3:  2 2016-03-14 2016-04-15       17
# 4:  2 2016-04-15 2016-05-17        2
# 5:  3 2016-05-01 2016-06-10       14
# 6:  3 2016-06-10 2016-07-15        3

foverlap join (as suggested in the comments)

library(data.table)
setDT(df1)
setDT(df2)

df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]

df2[, Date4 := Date3]


setkey(df1, ID, Date1, Date2)


foverlaps(df2,
          df1,
          by.x=c("ID", "Date3", "Date4"),
          type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]

#     ID      Date1      Date2 sumValue
# 1:  1 2016-03-01 2016-04-01        5
# 2:  1 2016-04-01 2016-05-01       14
# 3:  2 2016-03-14 2016-04-15       17
# 4:  2 2016-04-15 2016-05-17        2
# 5:  3 2016-05-01 2016-06-10       14
# 6:  3 2016-06-10 2016-07-15        3

Further reading

Rolling join on data.table with duplicate keys

foverlap joins in data.table

Community
  • 1
  • 1
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • 1
    GREAT!!!! THANK YOU SO MUCH Symbolix!!! That is exactly what I needed!!! While the methods proposed before were taking hours, with your solution it took only 28 seconds! I just cannot believe it! – EdM Mar 31 '16 at 02:51
  • 1
    @user3692906 - you're welcome. These joins have been invaluable for me too! – SymbolixAU Mar 31 '16 at 02:52
  • @user3692906 - and, an 'up vote' on an answer is always appreciated ;-) – SymbolixAU Mar 31 '16 at 02:55
  • I would love to do so but that is my very first question and I still cannot vote up. My reputation is at 13... but I will do so as soon as it comes to 15. ;-) – EdM Mar 31 '16 at 03:59
4

With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:

dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
#    ID      Date3      Date3 sum
# 1:  1 2016-03-01 2016-04-01   5
# 2:  1 2016-04-01 2016-05-01  14
# 3:  2 2016-03-14 2016-04-15  17
# 4:  2 2016-04-15 2016-05-17   2
# 5:  3 2016-05-01 2016-06-10  14
# 6:  3 2016-06-10 2016-07-15   3

The column names needs some fixing.. will work on it later.

Arun
  • 116,683
  • 26
  • 284
  • 387
1

Here's a base R solution using sapply():

df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
##   ID      Date1      Date2 SumValue
## 1  1 2016-03-01 2016-04-01        5
## 2  1 2016-04-01 2016-05-01       14
## 3  2 2016-03-14 2016-04-15       17
## 4  2 2016-04-15 2016-05-17        2
## 5  3 2016-05-01 2016-06-15       17
## 6  3 2016-06-01 2016-07-15        3

Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.


Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:

cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
    x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
    tapply(df2$Value[i2[x]],i1[x],sum);
}));
##   ID      Date1      Date2 SumValue
## 1  1 2016-03-01 2016-04-01        5
## 2  1 2016-04-01 2016-05-01       14
## 3  2 2016-03-14 2016-04-15       17
## 4  2 2016-04-15 2016-05-17        2
## 5  3 2016-05-01 2016-06-15       17
## 6  3 2016-06-01 2016-07-15        3
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • Thank you a lot! Two solutions is more than I expected. They do work on the test data set I've made! I'm having memory problems to run the second solution and I'm waiting the first to finish running... more than 1 hour! Just too much data! Do you have some documentation on the use of ri and function(ri)? – EdM Mar 29 '16 at 22:10