0

This is a continuation for a question I posted here Creating a new r data.table column based on values in another column and grouping, and to which @Frank provided an excellent answer.

As I have to do multiple of these calculations with different date intervals, I want to do a function which does them. However, I seem to be running into a scoping problem. I read the Vignettes, FAQ, and a ton of questions here and I still am left baffled.

We'll use the same data:

library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)), 
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE), 
purchaseAmount = sample(1:20, 10))

Here is the answer @Frank provided:

DT[, new_col := 
  DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
    sum(purchaseAmount)
  , by=.EACHI ]$V1
]


DT 

              date  zip purchaseAmount new_col
     1: 2016-01-08 1150              5       5
     2: 2016-01-15 3000             15      15
     3: 2016-02-15 1150             16      16
     4: 2016-02-20 2000             18      18
     5: 2016-03-07 2000             19      19
     6: 2016-03-15 2000             11      30
     7: 2016-03-17 2000              6      36
     8: 2016-04-02 1150             17      17
     9: 2016-04-08 3000              7       7
    10: 2016-04-09 3000             20      27

And now the actual problem I have encountered. I created the following function which enables dynamically changing the interval:

sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
    zip = substitute(zip)
    newColName = substitute(newColName)
    dt[, newColName := 
        dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
        , on=.(zip, date >= d0, date <= d1), 
        sum(purchaseAmount)
        , by=.EACHI ]$V1
    ]
}

sumPreviousPurchases(DT, prevPurch1to10, 0, 10)

DT

          date  zip purchaseAmount newColName
 1: 2016-02-07 1150              5          5
 2: 2016-02-14 3000             15         15
 3: 2016-03-16 1150             16         16
 4: 2016-03-21 2000             18         18
 5: 2016-04-06 2000             19         19
 6: 2016-04-14 2000             11         30
 7: 2016-04-16 2000              6         36
 8: 2016-05-02 1150             17         17
 9: 2016-05-08 3000              7          7
10: 2016-05-09 3000             20         27

What troubles me is the scoping. The function names the new column newColName regardless of what I insert in the function call. From reading I got that when calling for data.table column names in function arguments, one should use the substitute()-function. However, this does not work here, the result is the same even if I leave the whole newColName = substitute(newColName) line out. I suppose it is because the column does not exist yet, but I do not know how to address this issue.

As a bonus I would like to ask, is there also a way to name the columns dynamically, ie. in the example for instance to be "daysFrom_ to_daysUntil", and the name would be "0_to_10"?

----- EDIT ----

I also stumbled upon a possible answer myself, somewhat similarly to @lmo's answer using an idea from here: http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#assign-a-column-with--named-with-a-character-object

Most important differences on the question: I removed the newColName = substitute(newColName) entirely, and added brackets around the (newColName) on dt[, (newColName) :=

sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
    zip = substitute(zip)
    #newColName = substitute(newColName)
    dt[, (newColName) := 
        dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
        , on=.(zip, date >= d0, date <= d1), 
        sum(purchaseAmount)
        , by=.EACHI ]$V1
    ]
}

Additionally I added quotes to the "prevPurch1to10".

sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)

and got the answer

          date  zip purchaseAmount prevPurch1to10
 1: 2016-02-17 1150              7              7
 2: 2016-02-22 3000              8              8
 3: 2016-03-04 1150              2              2
 4: 2016-03-16 2000             14             14
 5: 2016-04-03 2000             11             11
 6: 2016-04-11 3000             12             12
 7: 2016-04-21 1150             17             17
 8: 2016-04-22 3000              3              3
 9: 2016-05-03 2000              9              9
10: 2016-05-11 3000              4              4

However, there are still the two following weird things:

a) substitute() is not needed when adding the brackets on (newColName). Why is that?

b) quotes are required around the "prevPurch1to10". Again, why? Is there a more data.tableish way to do this, without the quotes?

Community
  • 1
  • 1
Mantelimies
  • 73
  • 1
  • 6

1 Answers1

1

You can use substitute directly in the assignment:

sumPreviousPurchases = function(dt, newColName, daysFrom, daysUntil){
    zip = substitute(zip)
    dt[, substitute(newColName) := 
           dt[.(zip = zip, d0 = (date - daysUntil), d1 = (date - daysFrom))
              , on=.(zip, date >= d0, date <= d1), 
              sum(purchaseAmount)
              , by=.EACHI ]$V1
       ]
}

Then give it a try

sumPreviousPurchases(DT, "prevPurch1to10", 0, 10)

which returns

DT
          date  zip purchaseAmount prevPurch1to10
 1: 2016-02-07 1150              5              5
 2: 2016-02-14 3000             15             15
 3: 2016-03-16 1150             16             16
 4: 2016-03-21 2000             18             18
 5: 2016-04-06 2000             19             19
 6: 2016-04-14 2000             11             30
 7: 2016-04-16 2000              6             36
 8: 2016-05-02 1150             17             17
 9: 2016-05-08 3000              7              7
10: 2016-05-09 3000             20             27

Notes:

  • The parentheses in your solution () force the evaluation of the argument. This is implemented in base R and is a common technique across many programming languages, based on the mathematical concept of order of operations. (first evaluate objects in parentheses, then exponetiate, etc.). The use of substitute makes the substitution explicit, perhaps for easier reading.

  • Often, an argument to a function that will define a future object, like prevPurch1to10, requires quotes, since the object does not exist prior to calling the function. Using such an argument without quotes will usually result in an error: "object X not found."

lmo
  • 37,904
  • 9
  • 56
  • 69
  • Hi, thanks for this. I also added my own answer, because the `substitute` does not seem to be necessary, and you also added the quotes around the `"prevPurch1to10"`, the necessity of which I do not understand. – Mantelimies Feb 02 '17 at 13:20