data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

Question

I want to create a function which uses variable names of columns and variable name of data.

This function is what I want and it works :

n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"

# Objective :
FOO <- function(dataName = "d",
         colName = "x"){
  get(dataName)[, mean(get(colName)), by = grp]
}

The problem is that evaluation of get() for each group is very time-consuming. On a real data example it is 14 times longer than the static-name equivalent. I would like to reach the same execution time as if the column names were static.

What I tried :

(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))

microbenchmark::microbenchmark(

  # 1) works and quick but does not use variable names of columns (654ms)
  (t1 <- d[, mean(x), by = grp]),

  # 2) works but slow (1006ms)
  (t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow

  # 3) works but slow (4075ms)
  (t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),

  # 4) works but very slow (37202ms)
  (t4 <- get(dataName)[, eval(cl), by = grp]),

  # 5) double dot syntax doesn't work cause I don't master it
  # (t5 <- get(dataName)[, mean(..colName), by = grp]),

  times = 10)

Is the double dot syntax appropriate here ? Why is 4) so slow ? I took it from this post where it was the best option. I adapted the double dot syntax from this post.

Thanks a lot for your help !

Is there a particular reason your `FOO` function needs to get the data from a variable *name* using `get`? That method will be horribly difficult to troubleshoot based on namespace/environment inheritance (esp if in nested function calls), and seems unnecessarily inefficient (instead of `FOO("mydata")`, just call `FOO(mydata)`?). — r2evans, Oct 26 '21 at 12:31
Thanks ! I thought that passing the data to the function would copy it and would be less efficicent but it seems to be a misconception. I should learn more about the mechanics of R functions. Thanks for pointing out those drawbacks :) — Samuel Allain, Oct 26 '21 at 12:48
Use env var to parameterize your data.table query, note that this feature may not be yet on CRAN — jangorecki, Nov 03 '21 at 21:20

B. Christian Kamgang · Accepted Answer · 2021-10-26T11:11:24.057

It would be better to pass the dataset name d to the FOO function instead of passing the character string "d". Also, you can use lapply combined with .SD so that you can benefit from internal optimization instead of using mean(get(colName)).

FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
  dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}

Benchmark: `FOO` vs `FOO2`

set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))

microbenchmark::microbenchmark(
  FOO(),
  FOO2(),
  times=5L
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893     5
 FOO2()  255.0828  267.1322  297.0389  275.4467  281.9873  405.5456     5

Thank you very much Christian ! It works very well ! In my real-life example, I am doing several operations at once with several columns such as `colName` : `.N`, `sum(colName1)`, `sum(colName2 * colName3)`... so I put all the needed columns in `.SDcols` and referred to each by `.SD[,1]`, `.SD[,2]` — Samuel Allain, Oct 26 '21 at 12:31

data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

1 Answers1

Benchmark: FOO vs FOO2

Benchmark: `FOO` vs `FOO2`