1

My question is inspired by Cumulative sum in a window (or running window sum) based on a condition in R.

I want to calculate running window sum just as in above post with a little twist. I'd want to "carry forward" the value of cumulative sum until "k" years even if there wouldn't be a row that satisfies filter criterion. In other words, we will need to add rows to the original data set.

This problem is challenging because I am still not comfortable with using apply function within data.table.

Here's my input data:

DFI <- structure(list(Year = c(2011, 2013, 2014, 2010, 2012, 2015), 
    Customer = c(13575, 13575, 13575, 13575, 13576, 13576), Product = c("R", 
    "R", "R", "W", "S", "R"), Rev = c(4, 1, 2, 1, 2, 2)), .Names = c("Year", 
"Customer", "Product", "Rev"), row.names = c(NA, -6L), class = "data.frame")

Here's my expected output:

DFO <- structure(list(Year = c(2011, 2012, 2013, 2014, 2015, 2010, 2011, 
2015, 2012, 2013), Customer = c(13575, 13575, 13575, 13575, 13575, 
13575, 13575, 13576, 13576, 13576), Product = c("R", "R", "R", 
"R", "R", "W", "W", "R", "S", "S"), Rev = c(4, 0, 1, 2, 0, 1, 
0, 2, 2, 0), CumRev = c(4, 4, 1, 3, 2, 1, 1, 2, 2, 2)), .Names = c("Year", 
"Customer", "Product", "Rev", "CumRev"), class = "data.frame", row.names = c(NA, 
-10L))

Some commentary about how I manually generated DFO:

a) Number of years in the window = 2 i.e. k=2

b) Although the entry for Year = 2012, Customer = 13575 and Product = R doesn't exist in DFI (input data), it was added because the cumulative sum from Year = 2011 would be carried forward for one more year (i.e. k-1 = 2-1 = 1). Hence, for this row, Rev = 0 and CumRev = 4

c) The entry for Year = 2015, Customer = 13575 and Product = R was added because there exists at least one entry in the table for Year = 2015. In other words, the range of Year to be added (or to be carried forward) would depend on two things: 1) The range of Year in input table 2) Length of running window.

Now, I did try to solve this problem myself before posting. I have spent close to 36 hours, and I was able to solve this. However, the issue is with expand.grid. In actual data, I run out of memory. Hence, I am wondering whether there is any better method (computationally less expensive and memory efficient) to solve this.

Here's my code:

Year<-unique(DFI$Year)
  Customer<-unique(DFI$Customer)
  Product<-unique(DFI$Product)
  DFO1<-expand.grid(Year = Year,Customer = Customer,Product = Product) #generate all combinations
  DFO1<-data.table::as.data.table(DFO1)

  #Do join between DFO and DFI to add Rev
  DFO1<-DFI[DFO1,on=c("Product","Customer","Year")]

   k<-2 #Number of years  = 2
   DFO1<-DFO1[order(Customer,Product,Year)]
   DFO1[is.na(Rev)]$Rev<-0

   DFO1<-DFO1[, CumRev := sapply(Year, function(year) sum(Rev[between(Year, year-k+1, year)])), by = .(Customer, Product)][order(Customer,Product,Year)]
   DFO1<-DFO1[CumRev!=0] #Remove zero rows

   DFO<-data.table::as.data.table(DFO)
   DFO<-DFO[order(Customer,Product,Year)]
   compare(DFO1,DFO) #TRUE

As someone who is just starting to learn to apply apply() in data.table, this was tough for me. I'd appreciate any thoughts to optimize this. I am willing to learn from this process. Thank you for your time and for any help extended to me.

watchtower
  • 4,140
  • 14
  • 50
  • 92
  • In the `DFO`, the full combinations are not there? – akrun Feb 15 '18 at 03:25
  • @akrun - Yes. I removed them (using `DFO1<-DFO1[CumRev!=0])` because we would only need to carry forward the cumulative sum for `k-1` years. – watchtower Feb 15 '18 at 03:27
  • why is there no entry for Year=2016, Customer=13576, Product=R? – chinsoon12 Feb 15 '18 at 03:38
  • @Chinsoon: Thanks for asking. There are two criteria to determine which rows will be added `max(year)` and `k-1`. Please see point c) above. If we had any one entry for, say, any other customer for year `2016`, we would have required to have it. Because there is no entry for `year = 2016`, we get a pass. In simple words, what it means is that `current year = 2015` – watchtower Feb 15 '18 at 03:41

1 Answers1

1

Explanation inline. Uses @G.Grothendieck's Sum function and also his application of his zoo::rollapplyr from Cumulative sum in a window (or running window sum) based on a condition in R

k <- 2
Sum <- function(x) {
    x <- matrix(x,, 2)
    FY <- x[, 1]
    Rev <- x[, 2]
    ok <- FY >= tail(FY, 1) - k + 1
    sum(Rev[ok])
}    


setDT(DFI)
#This is prob the only difference from your solution
#create a combination of year to year + k for each Customer and product.
#Then subset to remove future years
combis <- unique(rbindlist(lapply(seq_len(k), 
    function(n) unique(DFI[, .(Year=Year+n-1, Customer, Product)]))))[
        Year <= DFI[,max(Year)]]

#lookup revenue
out <- DFI[combis, on=.(Year, Customer, Product)][,
    Rev := ifelse(is.na(Rev), 0, Rev)]

#order before summing
setorder(out, Customer,Product,Year)
out[,CumRev := zoo::rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
    by = c("Customer", "Product"), .SDcols = c("Year", "Rev")][]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • thanks for your help. Could you please explain `unique(DFI[, .(Year=Year+n-1, Customer, Product)]))))`? – watchtower Feb 15 '18 at 04:09
  • 1
    It just takes your data set of product and customer and add a few years. For example if your window is 2 with a customer A, product P and year Y, it adds customer A, Product P and year Y+1 to the final universe that you want – chinsoon12 Feb 15 '18 at 05:53