5

Somewhat related to this question and this one, I'm having trouble calculating a rolling sum. Unlike those questions, I would like to try to use zoo:rollsum as analogous to the rollapply answer here. (But if there is a more data.table way to do it, by all means.)

Let's start with some data:

set.seed(123)
some_dates <- function(){as.Date('1980-01-01') + sort(sample.int(1e4,100))}
d <- data.table(cust_id = c(rep(123,100),rep(456,100)),
                purch_dt = c(some_dates(), some_dates()),
                purch_amt = round(runif(200, 1, 100),2) )

head(d)
#    cust_id   purch_dt purch_amt
# 1:     123 1980-01-08     24.63
# 2:     123 1980-09-03     96.27
# 3:     123 1981-02-24     60.54

I would like to do a rolling 365-day sum of purchase amount for each customer, calculated at each transaction day.

The answer here suggests the following approach:

First, create dummy rows for all customer-date pairs, using cross join, i.e. something like:

setkey(d, cust_id, purch_dt)
dummy <- d[ CJ(unique(cust_id), seq(min(purch_dt), max(purch_dt), by='day') ) ]
#    cust_id   purch_dt purch_amt
# 1:     123 1980-01-08     24.63
# 2:     123 1980-01-09        NA
# 3:     123 1980-01-10        NA

So far, so good (although I'm sure there's a way to tighten this dummy table to the customer-level min/max purch_dt).

My problem is how to use rollsumr to calculate a trailing 365-day sum.

I tried:

dummy[, purch_365 := rollsumr(x=purch_amt, k=365, na.rm=TRUE) , by=cust_id]

But this creates purch_365 as all NAs and gives two warnings like:

Warning messages:
1: In `[.data.table`(dummy, , `:=`(purch_365, rollsumr(x = purch_amt,  :
  Supplied 9550 items to be assigned to group 1 of size 9914 in column 'purch_365' (recycled leaving remainder of 364 items).

I get that 364 = k-1, and 2 warnings for 2 cust_ids. Other than that I'm at a loss.

# Desired output:
#    cust_id   purch_dt purch_amt purch_365
# 1:     123 1980-01-08     24.63     24.63
# 2:     123 1980-09-03     96.27    120.90
# 3:     123 1981-02-24     60.54    156.81

Thanks in advance!

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Did you tried `fill = NA`? it is mentioned in `?rollsum`. You should debug this on vector instead data.table so it would be easier to catch. – jangorecki Sep 16 '15 at 21:37
  • @jangorecki thanks but I tried `fill = NA`, interestingly it stopped the warning but resulted in the same column of `NA` – C8H10N4O2 Sep 17 '15 at 19:09

2 Answers2

7

Here's one way. First, add a column with the last date you care about, and an index to keep track of things:

d[, old.date := purch_dt - 365]
d[, idx := .I]

Then do a rolling join (assumes version 1.9.5+) on that date, and extract the range of indices for each of the match (i.e. by .EACHI):

res = d[d, .(idx = i.idx, seq = idx:i.idx), by = .EACHI, roll = -Inf,
        on = c(cust_id = 'cust_id', purch_dt = 'old.date')]

Finally, subset original data.table with appropriate range, and compute the sums:

d[, purch_365 := d[res$seq, sum(purch_amt), by = res$idx]$V1][]
#     cust_id   purch_dt purch_amt idx   old.date purch_365
#  1:     123 1980-01-08     24.63   1 1979-01-08     24.63
#  2:     123 1980-09-03     96.27   2 1979-09-04    120.90
#  3:     123 1981-02-24     60.54   3 1980-02-25    156.81
#  4:     123 1981-04-01     51.99   4 1980-04-01    208.80
#  5:     123 1981-04-02     40.85   5 1980-04-02    249.65
# ---                                                      
#196:     456 2006-01-29     24.72 196 2005-01-29    187.81
#197:     456 2006-02-15     27.78 197 2005-02-15    215.59
#198:     456 2006-09-22     11.00 198 2005-09-22     74.94
#199:     456 2006-09-27     12.67 199 2005-09-27     87.61
#200:     456 2006-11-18     99.13 200 2005-11-18    186.74
eddi
  • 49,088
  • 6
  • 104
  • 155
  • This works, thanks. It seems like this self-join should be faster than the creation of the larger dummy table with cross-join, although I couldn't get the `zoo::rollsum` approach to work, so I can't benchmark. Just curious -- you've answered a few of my `data.table` questions, and I'm curious how you've already mastered the dev version of data.table -- are you involved in the development, or how have you learned all this? I would like to do the same... – C8H10N4O2 Sep 18 '15 at 13:26
  • 3
    I am somewhat involved, though much less than a year or two ago. I think the only 'dev' part in the above is the `on` argument. If you want to always be on the cutting edge of features, I'd recommend simply subscribing to the github channel and reading the various updates/suggestions/discussions/etc. – eddi Sep 18 '15 at 15:22
  • Great solution @eddi! Any chance you could tweak it so that the purchases made on the focal date (purch_dt) are excluded? Simple way would be to add something like `d[, benchmark.date := purch_dt - 1]` and then use that date. I was wondering if something more elegant is possible? – SJDS Dec 07 '15 at 13:39
  • So I tried my own simple idea and that does not seem to work as the `seq` column remains increasing even when the dates are the same. Any suggestion? – SJDS Dec 07 '15 at 13:48
  • @SJDS wouldn't changing to `seq = (idx+1):i.idx` above do it? – eddi Dec 09 '15 at 16:27
  • @eddi, I previously managed to run a tweak of your code to get the results I needed, but when I tried to run it again I got a size-based error (`cannot allocate vector of size 3.1 Gb`) and a warning message: `In idx:i.idx : numerical expression has 2 elements: only the first used` while the size of my data has actually shrunk. Any idea if something changed in R recently ? – SJDS Jan 18 '16 at 05:10
  • @SJDS not sure about the size - maybe you ran out of memory? As for the warning - that sounds like an issue with the algorithm and I would investigate it further. – eddi Jan 19 '16 at 17:48
  • @eddi Really nice solution, I learned a lot! Nevertheless, I couldn't understand the rationale behind the `by = res$idx` since I expected the `by =` to receive columns references, not a list of values. I searched a lot about it but couldn't find a reference. Do you know any place where I can learn about this feature? – Jonatas Eduardo Feb 03 '17 at 20:48
  • 1
    @JonatasEduardo I'd start with the [github vignettes](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html). `by` can receive column names, but more generally it just receives a list of values, and groups by those. – eddi Feb 03 '17 at 21:55
0

This a total shot in the dark here because I'm not completely following your question, but perhaps making your NA values in purch_amt into numeric 0's might solve the problem? It's possible R is adding together NA values, which would result in the answer being NA.

For example NA + 1 = NA. Try changing those NA values to zero.

giraffehere
  • 1,118
  • 7
  • 18
  • Thanks, I tried tacking `[is.na(purch_amt), purch_amt := 0]` on to the end of the dummy table -- no help – C8H10N4O2 Sep 17 '15 at 19:21