5

I have been googling this answer for a few hours. A lot of people have asked similar questions, but I did not find either a simple enough question or a straightforward answer. Here is my approach:

Assume that I want to do a simple group by in data.table:

library(data.table)
mtcars = data.table(mtcars)
mtcars[,sum(mpg), gear]

# Here are the results
#   gear    V1
#1:    4 294.4
#2:    3 241.6
#3:    5 106.9

However, if I use a self-defined function to do this:

zz = function(data, var, group){
  return(data[,sum(var), group])
}
zz(mtcars, mpg, gear)

I got an error message:

Error in eval(bysub, parent.frame(), parent.frame()) : object 'gear' not found

I've tried substitute, eval, quote, and other solutions, but none of them works. I wonder if anyone could give a more straightforward solution and explanation to this.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Miao Cai
  • 902
  • 9
  • 25

3 Answers3

5

If we are using unquoted arguments, substitute and evaluate

zz <- function(data, var, group){
 var <- substitute(var)
 group <- substitute(group)
 setnames(data[, sum(eval(var)), by = group],
        c(deparse(group), deparse(var)))[]
 # or use
 #  setnames(data[, sum(eval(var)), by = c(deparse(group))], 2, deparse(var))[]

}
zz(mtcars, mpg, gear)
#   gear   mpg
#1:    4 294.4
#2:    3 241.6
#3:    5 106.9
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @Gregor I am guessing that it could be related to the `env` `substitute(expr, env)` as the `substitute` would check for the variables within the env of the data.table. Can you try by changing the `env` – akrun Oct 31 '19 at 18:14
  • That makes sense, but after playing a little bit I'm still not getting it to work. Guess I'll have to ask a question... – Gregor Thomas Oct 31 '19 at 18:27
  • @Gregor I also tried it. I remember similar question before but couldn't find out. Please do ask as a question. Thanks – akrun Oct 31 '19 at 18:28
  • Thanks for the prompt answer! I am having the same question as @Gregor. Could you also post your question in the comment of this question? – Miao Cai Oct 31 '19 at 18:38
  • 1
    New question is here: https://stackoverflow.com/q/58649510/903061 (now I wish I hadn't deleted the original comment) – Gregor Thomas Oct 31 '19 at 18:45
  • 1
    @Gregor Thank you! – Miao Cai Oct 31 '19 at 21:43
5

While not perfect, the ... argument can be helpful:

zz = function(dt, ...){
  return(dt[...])
}

zz(mtcars, , sum(mpg), gear)

   gear    V1
1:    4 294.4
2:    3 241.6
3:    5 106.9
Cole
  • 11,130
  • 1
  • 9
  • 24
1

I don't really see the point in writing a function that takes unquoted arguments. Why not just use the data.table syntax directly?

If you want to write a function it makes more sense to take a character vector of column names since this is way more programmable than symbols (think about programming with setkey versus setkeyv or writing functions to create ggplots with aes versus aes_string ) The downside is that the internal of the function is messy and requires eval(parse(text=.))) NSE in order for GForce to work correctly, but the function interface is more extensible.

zz = function(data, var, group){
  eval(parse(text=paste0("data[,sum(",var,"),by=",group,"]")))
}
zz(mtcars, "mpg", "gear")
Michael
  • 5,808
  • 4
  • 30
  • 39
  • 1
    Hey Michael! Definitely an interesting approach. I don't write many functions like this myself, but I definitely see the appeal of unquoted arguments for interactive use, especially with tab completion. The eval-parse-paste still *feels* bad to me, though I can't identify any particular risk in this case. – Gregor Thomas Feb 23 '20 at 21:06
  • It took me a while to settle on this. I've seen Matt Dowle recommending this approach (i think he's referred to this as a "macro") and that's made me a bit more confident that this is a good way to go. I believe Matt has spent a lot of time SQL programming which relies pretty heavily on macro programming in the form of procedures which maybe explains why this is his recommendation for this approach to function writing around data.table. – Michael Feb 24 '20 at 00:59
  • Unquoted strings are great for interactive scripting but as i said in that case I'd stick with directly using the data.table interface since it will be more readable to me later on or to others who are data.table literate. The data.table syntax is so concise that it doesn't worth it to obscure it even for commonly used operations. The above approach i think is for more for complicated reusable procedures in data pipelines (eg replacing sql procedures). – Michael Feb 24 '20 at 01:03
  • I'll add that this to my knowledge this is the easiest (only?) way to programmatically use data.table for arbitrary column names in a way that doesnt break the optimizations. (this is maybe partially based on my definition of "arbitrary"--i don't consider a symbol an arbitrary column name since it's a language object rather than a vector. My approach obscures and terminates the NSE versus a function taking unquoted string puts the burden on the user to do more nse to program with that function to use it programmatically ).But if there's another way to use quoted strings id love to hear it – Michael Feb 24 '20 at 01:09
  • 1
    Well, the `by` argument can be a quoted string---either a character vector of column names or a length-one char vector of comma separated names, but I think you're right about `j`, using `.SDcols` is (I think) less efficient. Still, `x = c("mpg", "hp")`, `g = c("am", "cyl")`, then `mtcars[, lapply(.SD, mean), by = g, .SDcols = g)` is pretty nice to write. But certainly not as flexible as allowing the user to input an arbitrary string. – Gregor Thomas Feb 24 '20 at 01:20
  • 1
    Meanwhile, akrun just answered [this question](https://stackoverflow.com/a/60366161/903061) as I typed this comment... – Gregor Thomas Feb 24 '20 at 01:29
  • 1
    Yeah the .SD approach is one I've experimented with as well and almost meets my criteria for arbitrary programming (it doesnt work in the i statement, and you have do some other tricks to get arbitrary column names in LHS assignment). Ultimately I settled against the .SD approach for programming is because it requires keeping track of indices of .SD which i found difficult to follow for complicated j statements (even more so than macros which at least can be debugged by evaluating the paste to get very readable data.table syntax) – Michael Feb 24 '20 at 01:43
  • 1
    For example it might not be straightforward to separately and repeatedly, for example, take the product between column 1 of a data.table and an arbitrary number of other columns without using some finicky .SD indexing combined with an lapply. – Michael Feb 24 '20 at 01:51