My question is inspired by Cumulative sum in a window (or running window sum) based on a condition in R.
I want to calculate running window sum just as in above post with a little twist. I'd want to "carry forward" the value of cumulative sum until "k" years even if there wouldn't be a row that satisfies filter criterion. In other words, we will need to add rows to the original data set.
This problem is challenging because I am still not comfortable with using apply
function within data.table
.
Here's my input data:
DFI <- structure(list(Year = c(2011, 2013, 2014, 2010, 2012, 2015),
Customer = c(13575, 13575, 13575, 13575, 13576, 13576), Product = c("R",
"R", "R", "W", "S", "R"), Rev = c(4, 1, 2, 1, 2, 2)), .Names = c("Year",
"Customer", "Product", "Rev"), row.names = c(NA, -6L), class = "data.frame")
Here's my expected output:
DFO <- structure(list(Year = c(2011, 2012, 2013, 2014, 2015, 2010, 2011,
2015, 2012, 2013), Customer = c(13575, 13575, 13575, 13575, 13575,
13575, 13575, 13576, 13576, 13576), Product = c("R", "R", "R",
"R", "R", "W", "W", "R", "S", "S"), Rev = c(4, 0, 1, 2, 0, 1,
0, 2, 2, 0), CumRev = c(4, 4, 1, 3, 2, 1, 1, 2, 2, 2)), .Names = c("Year",
"Customer", "Product", "Rev", "CumRev"), class = "data.frame", row.names = c(NA,
-10L))
Some commentary about how I manually generated DFO
:
a) Number of years in the window = 2 i.e. k=2
b) Although the entry for Year = 2012
, Customer = 13575
and Product = R
doesn't exist in DFI
(input data), it was added because the cumulative sum from Year = 2011
would be carried forward for one more year (i.e. k-1 = 2-1 = 1
). Hence, for this row, Rev = 0
and CumRev = 4
c) The entry for Year = 2015
, Customer = 13575
and Product = R
was added because there exists at least one entry in the table for Year = 2015
. In other words, the range of Year
to be added (or to be carried forward) would depend on two things: 1) The range of Year
in input table 2) Length of running window.
Now, I did try to solve this problem myself before posting. I have spent close to 36 hours, and I was able to solve this. However, the issue is with expand.grid.
In actual data, I run out of memory. Hence, I am wondering whether there is any better method (computationally less expensive and memory efficient) to solve this.
Here's my code:
Year<-unique(DFI$Year)
Customer<-unique(DFI$Customer)
Product<-unique(DFI$Product)
DFO1<-expand.grid(Year = Year,Customer = Customer,Product = Product) #generate all combinations
DFO1<-data.table::as.data.table(DFO1)
#Do join between DFO and DFI to add Rev
DFO1<-DFI[DFO1,on=c("Product","Customer","Year")]
k<-2 #Number of years = 2
DFO1<-DFO1[order(Customer,Product,Year)]
DFO1[is.na(Rev)]$Rev<-0
DFO1<-DFO1[, CumRev := sapply(Year, function(year) sum(Rev[between(Year, year-k+1, year)])), by = .(Customer, Product)][order(Customer,Product,Year)]
DFO1<-DFO1[CumRev!=0] #Remove zero rows
DFO<-data.table::as.data.table(DFO)
DFO<-DFO[order(Customer,Product,Year)]
compare(DFO1,DFO) #TRUE
As someone who is just starting to learn to apply apply()
in data.table
, this was tough for me. I'd appreciate any thoughts to optimize this. I am willing to learn from this process. Thank you for your time and for any help extended to me.