I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column. The sums should be calculated for each score column by date and group. For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
- What code should replace "Score" to calculate time window sums for each column in a range of columns?
- Is the solution provided the most efficient version for fast calculation on large dataset?