8

I was attempting to answer this nice question about creating a non-standard evaluating function for a data.table object, doing a grouped sum. Akrun came up with a lovely answer which I'll simplify here:

akrun <- function(data, var, group){
 var <- substitute(var)
 group <- substitute(group)
 data[, sum(eval(var)), by = group]
}

library(data.table)
mt = as.data.table(mtcars)
akrun(mt, cyl, mpg)
#    group    V1
# 1:     6 138.2
# 2:     4 293.3
# 3:     8 211.4

I was also working on an answer, and had close to the same answer, but with the substitutes inline with the rest. Mine results in an error:

gregor = function(data, var, group) {
  data[, sum(eval(substitute(var))), by = substitute(group)]
} 

gregor(mt, mpg, cyl)
# Error in `[.data.table`(data, , sum(eval(substitute(var))), by = substitute(group)) : 
#  'by' or 'keyby' must evaluate to vector or list of vectors 
#  (where 'list' includes data.table and data.frame which are lists, too) 

At its face, my function is a simple substitution of Akrun's. Why doesn't it work?


Note that both substitutions cause problems, as shown here:

gregor_1 = function(data, var, group) {
  var = substitute(var)
  data[,sum(eval(var)), 
       by = substitute(group)]
} 
gregor_1(mt, mpg, cyl)
# Same error as above


gregor_2 = function(data, var, group) {
  group = substitute(group)
  data[,sum(eval(substitute(var))), 
       by = group]
} 
gregor_2(mt, mpg, cyl)
# Error in eval(substitute(var)) : object 'mpg' not found 
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    I don't have time (or probably the ability) to dig more into this right now, but it seems like within the `data.table` call, whatever variables you pass to `substitute` remain symbols (e.g., if you wrap it in `rleid` or `as.integer` the warning is a little clearer to noobs like me). Is that correct? So, `substitute` does not look within the functions environment when used inside a data.table call? Also, I am very green on all things data.table, so I apologize if this is already evident. – Andrew Oct 31 '19 at 19:24
  • 1
    I think that's a nice contribution. Akrun's original suggestion was about environments too, I tried versions like `substitute(var, env = parent.frame(environment()))`, also with no luck. But it seems likely that is the right track. – Gregor Thomas Oct 31 '19 at 19:49
  • Hey Gregor! (Michael from UW STATR). I've been thinking a lot recently about writing functions for data.table procedures and have gone down the road of creating functions that take character vectors of column names as arguments rather than unquoted symbols. In general writing functions with data.table has some potential issues where the function might make use of the underlying data.table object by creating temporary columns to avoid copying the entire table. potential for column name conflicts and changing the key of the input table. I'd like to see more discussion around best practices. – Michael Feb 21 '20 at 19:18
  • see my answer https://stackoverflow.com/questions/58648886/a-simple-reproducible-example-to-pass-arguments-to-data-table-in-a-self-defined/60344884#60344884 – Michael Feb 21 '20 at 19:21

3 Answers3

7

In substitute's documentation you can read how it decides what to substitute, and the fact that, by default, it searches the environment where it is called. If you call substitute inside the data.table frame (i.e. inside []) it won't be able to find the symbols because they are not present inside the data.table evaluation environment, they are in the environment where [ was called.

You can "invert" the order in which the functions are called in order to get the behavior you want:

library(data.table)

foo <- function(dt, group, var) {
    eval(substitute(dt[, sum(var), by = group]))
}

foo(as.data.table(mtcars), cyl, mpg)
   cyl    V1
1:   6 138.2
2:   4 293.3
3:   8 211.4
Alexis
  • 4,950
  • 1
  • 18
  • 37
3

It seems that substitute does not work within data table in the way one might expect from how it works in other contexts but you can use enexpr from the rlang package in place of substitute:

library(data.table)
library(rlang)

gregor_rlang = function(data, var, group) {
  data[, sum(eval(enexpr(var))), by = .(group = eval(enexpr(group)))]
} 

gregor_rlang(mt, mpg, cyl)
##    group    V1
## 1:     6 138.2
## 2:     4 293.3
## 3:     8 211.4

environments

The problem seems to be related to environments as this works where we have specifically given the environment substitute should use.

gregor_pf = function(data, val, group) {
  data[, sum(eval(substitute(val, parent.env(environment())))), 
    by = c(deparse(substitute(group)))]
} 
gregor_pf(mt, mpg, cyl)
##      cyl    V1
## 1:     6 138.2
## 2:     4 293.3
## 3:     8 211.4
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

data.table uses NSE because it needs to analyse/manipulate the by argument before choosing if it will evaluate it or not (if you give it a symbol for example it won't evaluate it).

A consequence is that if the argument needs to be evaluated it should be evaluated in the right environment and this is the function's responsibility. data.table evaluates its by argument in the data, not in the calling environment.

In most cases you don't see the issue as the symbol will be evaluated in the parent environment if not found, but substitute() is more sensitive.

See example below :

fun <- function(x){
  standard_eval(x)
  non_standard_eval_safe(x)
  non_standard_eval_not_safe(x)
}

standard_eval          <- function(expr) print(expr)

non_standard_eval_safe <- function(expr) {
  expr <- bquote(print(.(substitute(expr)))) # will be quote(print(x)) in our example
  eval.parent(expr)
}

non_standard_eval_not_safe <- function(expr) {
  expr <- bquote(print(.(substitute(expr))))  # will be quote(print(x)) in our example
  eval(expr)
}

standard_eval(1+1)          
#> [1] 2

non_standard_eval_safe(1+1)
#> [1] 2

non_standard_eval_not_safe(1+1)
#> [1] 2

fun(1+1)
#> [1] 2
#> [1] 2
#> Error in print(x): object 'x' not found


Created on 2020-02-20 by the reprex package (v0.3.0)

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167