2

I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.

Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column. The sums should be calculated for each score column by date and group. For 1st and 2nd dates within a group, fewer than 3 values is allowed.

   GROUP        DATE      LAGGED  SCORE1     SUM1   SCORE2     SUM2  ... SCORE50   SUM50

#1     A  2017-04-01  2017-03-30       1      1|1        2      2|2            4     4|4
#2     A  2017-04-02  2017-03-31       1    1+1|2        3    3+2|5            3   3+4|7
#3     A  2017-04-04  2017-04-02       2  2+1+1|4        4  4+3+2|9            2 2+3+4|9

#5     B  2017-04-02  2017-03-31       2      2|2        3      3|3            1     1|1
#6     B  2017-04-05  2017-04-03       2    2+2|4        2    2+3|5            1   1+1|2
#7     B  2017-04-08  2017-04-06       3  3+2+2|7        1  1+2+3|6            3 3+1+1|5

#8     C  2017-04-02  2017-03-31       3      3|3        1      1|1            1     1|1
#9     C  2017-04-03  2017-04-01       2    2+3|5        3    3+1|4            2   2+1|3

 :     :           :           :       :        :        :        :            :       :

#10M  XX  2018-03-30  2018-03-28       2        2        1        1    ...     1       1

David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.

    library(data.table) #v1.10.4

    ## Convert to a proper date class, and add another column 
    ## in order to define the range

    setDT(input)[, c("Date", "Date2") := {
      Date = as.IDate(Date)
      Date2 = Date - 2L
      .(Date, Date2)
    }]

    ## Run a non-equi join against the unique Date/Group combination in input
    ## Sum the Scores on the fly
    ## You can ignore the second Date column 

    input[unique(input, by = c("Date", "Group")),   ## This removes the dupes
       on = .(Group, Date <= Date, Date >= Date2),  ## The join condition
       .(Score = sum(Score)),                       ## sum the scores
       keyby = .EACHI]                          ## Run the sum by each row in
                                                ## unique(input, by = c("Date", "Group"))

My question has two parts:

  1. What code should replace "Score" to calculate time window sums for each column in a range of columns?
  2. Is the solution provided the most efficient version for fast calculation on large dataset?
Jaap
  • 81,064
  • 34
  • 182
  • 193
em1031
  • 35
  • 5

1 Answers1

3

A possible solution:

cols <- grep('^SCORE', names(input), value = TRUE)

input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
      , by = GROUP
      , .SDcols = cols][]

which gives:

   GROUP       DATE     LAGGED SCORE1 SCORE2 SUM1 SUM2
1:     A 2017-04-01 2017-03-30      1      2    1    2
2:     A 2017-04-02 2017-03-31      1      3    2    5
3:     A 2017-04-04 2017-04-02      2      4    4    9
4:     B 2017-04-02 2017-03-31      2      3    2    3
5:     B 2017-04-05 2017-04-03      2      2    4    5
6:     B 2017-04-08 2017-04-06      3      1    7    6
7:     C 2017-04-02 2017-03-31      3      1    3    1
8:     C 2017-04-03 2017-04-01      2      3    5    4

When you want to take a time window into account as well, you could do (assuming LAGGED is the start of the time-window):

input[input[input[, .(GROUP, DATE, LAGGED)]
            , on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
            ][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
              , by = .(GROUP, DATE = DATE.1)
              , .SDcols = cols]
      , on = .(GROUP, DATE)]

which gives:

   GROUP       DATE     LAGGED SCORE1 SCORE2 SUM1 SUM2
1:     A 2017-04-01 2017-03-30      1      2    1    2
2:     A 2017-04-02 2017-03-31      1      3    2    5
3:     A 2017-04-04 2017-04-02      2      4    3    7
4:     B 2017-04-02 2017-03-31      2      3    2    3
5:     B 2017-04-05 2017-04-03      2      2    2    2
6:     B 2017-04-08 2017-04-06      3      1    3    1
7:     C 2017-04-02 2017-03-31      3      1    3    1
8:     C 2017-04-03 2017-04-01      2      3    5    4

Used data:

input <- fread('   GROUP        DATE      LAGGED  SCORE1     SCORE2     
     A  2017-04-01  2017-03-30       1        2  
     A  2017-04-02  2017-03-31       1        3
     A  2017-04-04  2017-04-02       2        4
     B  2017-04-02  2017-03-31       2        3 
     B  2017-04-05  2017-04-03       2        2
     B  2017-04-08  2017-04-06       3        1
     C  2017-04-02  2017-03-31       3        1
     C  2017-04-03  2017-04-01       2        3')
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • thanks! this helps with my first question for applying to multiple columns. looking for a specific limited time window to stay consistent, is it possible to limit cumsum to a period window? replacing with rollapply from zoo only works with subsets, it's far too slow for the size of my data. – em1031 Mar 11 '18 at 15:00
  • @em1031 see the update; is that what you are looking for? – Jaap Mar 11 '18 at 16:32