This is a continuation for a question I posted here Creating a new r data.table column based on values in another column and grouping, and to which @Frank provided an excellent answer.
As I have to do multiple of these calculations with different date intervals, I want to do a function which does them. However, I seem to be running into a scoping problem. I read the Vignettes, FAQ, and a ton of questions here and I still am left baffled.
We'll use the same data:
library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)),
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE),
purchaseAmount = sample(1:20, 10))
Here is the answer @Frank provided:
DT[, new_col :=
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
DT
date zip purchaseAmount new_col
1: 2016-01-08 1150 5 5
2: 2016-01-15 3000 15 15
3: 2016-02-15 1150 16 16
4: 2016-02-20 2000 18 18
5: 2016-03-07 2000 19 19
6: 2016-03-15 2000 11 30
7: 2016-03-17 2000 6 36
8: 2016-04-02 1150 17 17
9: 2016-04-08 3000 7 7
10: 2016-04-09 3000 20 27
And now the actual problem I have encountered. I created the following function which enables dynamically changing the interval:
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
newColName = substitute(newColName)
dt[, newColName :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
sumPreviousPurchases(DT, prevPurch1to10, 0, 10)
DT
date zip purchaseAmount newColName
1: 2016-02-07 1150 5 5
2: 2016-02-14 3000 15 15
3: 2016-03-16 1150 16 16
4: 2016-03-21 2000 18 18
5: 2016-04-06 2000 19 19
6: 2016-04-14 2000 11 30
7: 2016-04-16 2000 6 36
8: 2016-05-02 1150 17 17
9: 2016-05-08 3000 7 7
10: 2016-05-09 3000 20 27
What troubles me is the scoping. The function names the new column newColName
regardless of what I insert in the function call. From reading I got that when calling for data.table column names in function arguments, one should use the substitute()
-function. However, this does not work here, the result is the same even if I leave the whole newColName = substitute(newColName)
line out. I suppose it is because the column does not exist yet, but I do not know how to address this issue.
As a bonus I would like to ask, is there also a way to name the columns dynamically, ie. in the example for instance to be "daysFrom
_ to_daysUntil
", and the name would be "0_to_10"?
----- EDIT ----
I also stumbled upon a possible answer myself, somewhat similarly to @lmo's answer using an idea from here: http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#assign-a-column-with--named-with-a-character-object
Most important differences on the question: I removed the newColName = substitute(newColName)
entirely, and added brackets around the (newColName)
on dt[, (newColName) :=
sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
zip = substitute(zip)
#newColName = substitute(newColName)
dt[, (newColName) :=
dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
, on=.(zip, date >= d0, date <= d1),
sum(purchaseAmount)
, by=.EACHI ]$V1
]
}
Additionally I added quotes to the "prevPurch1to10"
.
sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)
and got the answer
date zip purchaseAmount prevPurch1to10
1: 2016-02-17 1150 7 7
2: 2016-02-22 3000 8 8
3: 2016-03-04 1150 2 2
4: 2016-03-16 2000 14 14
5: 2016-04-03 2000 11 11
6: 2016-04-11 3000 12 12
7: 2016-04-21 1150 17 17
8: 2016-04-22 3000 3 3
9: 2016-05-03 2000 9 9
10: 2016-05-11 3000 4 4
However, there are still the two following weird things:
a) substitute()
is not needed when adding the brackets on (newColName)
. Why is that?
b) quotes are required around the "prevPurch1to10"
. Again, why? Is there a more data.table
ish way to do this, without the quotes?