4

I want to create a function which uses variable names of columns and variable name of data.

This function is what I want and it works :

n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"

# Objective :
FOO <- function(dataName = "d",
         colName = "x"){
  get(dataName)[, mean(get(colName)), by = grp]
}

The problem is that evaluation of get() for each group is very time-consuming. On a real data example it is 14 times longer than the static-name equivalent. I would like to reach the same execution time as if the column names were static.

What I tried :

(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))

microbenchmark::microbenchmark(

  # 1) works and quick but does not use variable names of columns (654ms)
  (t1 <- d[, mean(x), by = grp]),

  # 2) works but slow (1006ms)
  (t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow

  # 3) works but slow (4075ms)
  (t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),

  # 4) works but very slow (37202ms)
  (t4 <- get(dataName)[, eval(cl), by = grp]),

  # 5) double dot syntax doesn't work cause I don't master it
  # (t5 <- get(dataName)[, mean(..colName), by = grp]),

  times = 10)

Is the double dot syntax appropriate here ? Why is 4) so slow ? I took it from this post where it was the best option. I adapted the double dot syntax from this post.

Thanks a lot for your help !

Samuel Allain
  • 344
  • 1
  • 7
  • 2
    Is there a particular reason your `FOO` function needs to get the data from a variable *name* using `get`? That method will be horribly difficult to troubleshoot based on namespace/environment inheritance (esp if in nested function calls), and seems unnecessarily inefficient (instead of `FOO("mydata")`, just call `FOO(mydata)`?). – r2evans Oct 26 '21 at 12:31
  • Thanks ! I thought that passing the data to the function would copy it and would be less efficicent but it seems to be a misconception. I should learn more about the mechanics of R functions. Thanks for pointing out those drawbacks :) – Samuel Allain Oct 26 '21 at 12:48
  • 1
    Use env var to parameterize your data.table query, note that this feature may not be yet on CRAN – jangorecki Nov 03 '21 at 21:20

1 Answers1

2

It would be better to pass the dataset name d to the FOO function instead of passing the character string "d". Also, you can use lapply combined with .SD so that you can benefit from internal optimization instead of using mean(get(colName)).

FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
  dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}

Benchmark: FOO vs FOO2

set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))

microbenchmark::microbenchmark(
  FOO(),
  FOO2(),
  times=5L
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893     5
 FOO2()  255.0828  267.1322  297.0389  275.4467  281.9873  405.5456     5
  • Thank you very much Christian ! It works very well ! In my real-life example, I am doing several operations at once with several columns such as `colName` : `.N`, `sum(colName1)`, `sum(colName2 * colName3)`... so I put all the needed columns in `.SDcols` and referred to each by `.SD[,1]`, `.SD[,2]` – Samuel Allain Oct 26 '21 at 12:31