I want to create a function which uses variable names of columns and variable name of data.
This function is what I want and it works :
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"
# Objective :
FOO <- function(dataName = "d",
colName = "x"){
get(dataName)[, mean(get(colName)), by = grp]
}
The problem is that evaluation of get()
for each group is very time-consuming. On a real data example it is 14 times longer than the static-name equivalent. I would like to reach the same execution time as if the column names were static.
What I tried :
(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))
microbenchmark::microbenchmark(
# 1) works and quick but does not use variable names of columns (654ms)
(t1 <- d[, mean(x), by = grp]),
# 2) works but slow (1006ms)
(t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow
# 3) works but slow (4075ms)
(t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),
# 4) works but very slow (37202ms)
(t4 <- get(dataName)[, eval(cl), by = grp]),
# 5) double dot syntax doesn't work cause I don't master it
# (t5 <- get(dataName)[, mean(..colName), by = grp]),
times = 10)
Is the double dot syntax appropriate here ? Why is 4) so slow ? I took it from this post where it was the best option. I adapted the double dot syntax from this post.
Thanks a lot for your help !