7

I have a data.table with date, zipcode and purchase amounts.

library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)), 
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE), 
purchaseAmount = sample(1:20, 10))  

This creates the following:

    date       zip              purchaseAmount
 1: 2016-01-08 1150              5
 2: 2016-01-15 3000             15
 3: 2016-02-15 1150             16
 4: 2016-02-20 2000             18
 5: 2016-03-07 2000             19
 6: 2016-03-15 2000             11
 7: 2016-03-17 2000              6
 8: 2016-04-02 1150             17
 9: 2016-04-08 3000              7
10: 2016-04-09 3000             20

I would like to add a fourth column earlierPurchases. This column should sum all the values in purchaseAmount for the previous x date within the zipcode.

EDIT: As per suggestion from Frank, here is the expected output:

          date  zip purchaseAmount new_col
 1: 2016-01-08 1150              5       5
 2: 2016-01-15 3000             15      15
 3: 2016-02-15 1150             16      16
 4: 2016-02-20 2000             18      18
 5: 2016-03-07 2000             19      19
 6: 2016-03-15 2000             11      30
 7: 2016-03-17 2000              6      36
 8: 2016-04-02 1150             17      17
 9: 2016-04-08 3000              7       7
10: 2016-04-09 3000             20      27

Is there a data.table way to do this, or should I just write a looping function?

Mantelimies
  • 73
  • 1
  • 6
  • 2
    Re your comment below, this is why the recommendation is to post both your input and expected output, so it is clear if an answer achieves the desired result simply from reading the question. Here's some guidance: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 It's not too late for you to edit it in now. – Frank Jan 03 '17 at 21:31
  • 1
    Thank you for the guidance and for editing the code @Frank . I edited the output to the question and also read the vignettes so I would better understand your solution. As this was my first question on the forums I really appreciate the polite and helpful way you provided the answers and suggestions. – Mantelimies Jan 04 '17 at 17:20

2 Answers2

11

This seems to work:

DT[, new_col := 
  DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
    sum(purchaseAmount)
  , by=.EACHI ]$V1
]


          date  zip purchaseAmount new_col
 1: 2016-01-08 1150              5       5
 2: 2016-01-15 3000             15      15
 3: 2016-02-15 1150             16      16
 4: 2016-02-20 2000             18      18
 5: 2016-03-07 2000             19      19
 6: 2016-03-15 2000             11      30
 7: 2016-03-17 2000              6      36
 8: 2016-04-02 1150             17      17
 9: 2016-04-08 3000              7       7
10: 2016-04-09 3000             20      27

This uses a "non-equi" join, effectively taking each row; finding all rows that meet our criteria in the on= expression for each row; and then summing by row (by=.EACHI). In this case, a non-equi join is probably less efficient than some rolling-sum approach.


How it works.

To add columns to a data.table, the usual syntax is DT[, new_col := expression]. Here, the expression actually works even outside of the DT[...]. Try running it on its own:

DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
  sum(purchaseAmount)
, by=.EACHI ]$V1

You can progressively simplify this until it's just the join...

DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
  sum(purchaseAmount)
, by=.EACHI ]
# note that V1 is the default name for computed columns

DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1)]
# now we're down to just the join

The join syntax is like x[i, on=.(xcol = icol, xcol2 < icol2)], as documented in the doc page that opens when you type ?data.table into an R console with the data.table package loaded.

To get started with data.table, I'd suggest reviewing the vignettes. After that, this'll probably look a lot more legible.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thanks for the answer. However, the result is not what I am looking for. The value in `new_col` on row 6 should be 30 (11+19) and on row 5 it should just be 19. This also holds for the first 4 rows as there are no other purchases on that `zip` within the 10 previous days. Unfortunately I do not immediately understand what you did with your code as I am a beginner and therefore I am unable to provide a solution based on modifying what you wrote. – Mantelimies Jan 03 '17 at 21:13
  • Ok, thanks for the extra info. I've edited and now find that it matches those two rows, too (though, unfortunately, the code is now more complicated). – Frank Jan 03 '17 at 21:48
-1

I didn't find any data.table solutions, this is how I got it though:

library(dplyr)
earlierPurchases <- vector()

for(i in 1:nrow(DT)) {
  temp <- dplyr::filter(DT, zip == zip[i] & date < date[i])
  earlierPurchases[i] <- sum(temp$purchaseAmount)
}

DT <- cbind(DT, earlierPurchases)

It worked quite fast.

Derek Corcoran
  • 3,930
  • 2
  • 25
  • 54