8

Preread

I went through some material here on SO:

and after getting a perfect answer to my previous problem, I am trying to once and for all get my head around how to canonically deal with data.tables in functions.

Underlying Problem

What I eventually want is to create a function which takes some R expressions as inputs and evaluates them in the context of a data.table (both in the i as well as in the j part). The quoted answers tell me that I have to use some get/eval/substitute combination if my inputs become more complicated than just a single column (in which case I could live with the ..string or the with = FALSE approach [1]).

My real data is rather big, so I am concerned about computational time.

Ultimately, if I want to have full flexibility (that is passing in expressions rather than bare column names), I understood that I have to go for an eval approach:

Codes speaks a thousand words, so let's illustrate what I found out so far:

Setup

library(data.table)
iris <- copy(iris)
setDT(iris)

Workhorse Function

my_fun <- function(my_i, my_j, option_sel = 1, my_data = iris, by = NULL) {
   switch(option_sel,
      {
         ## option 1 - base R deparse
         my_data[eval(parse(text = deparse(substitute(my_i)))), 
                 eval(parse(text = deparse(substitute(my_j)))),
                 by]
      },
      {
         ## option 2 - base R even shorter
         my_data[eval(substitute(my_i)), 
                 eval(substitute(my_j)),
                 by]

      },
      {
         ## option 3 - rlang
         my_data[rlang::eval_tidy(rlang::enexpr(my_i)),
                 rlang::eval_tidy(rlang::enexpr(my_j), data = .SD),
                 by]

      },
      {
         ## option 4 - if passing only simple column name strings
         ## we can use `with` (in j only)
         my_data[,
                 my_j, with = FALSE,
                 by]

      },
      {
         ## option 5 - if passing only simple column name strings 
         ## we can use ..syntax (in 'j' only)
         my_data[,
                 ..my_j]
                 # , by] ## would give a strange error

      },
      {
         ## option 6 - if passing only simple column name strings
         ## we can use `get`
         my_data[,
                 setNames(.(get(my_j)), my_j),
                 by]

      }
   )
}

Results

## added the unnecessary NULL to enforce same format
## did not want to make complicated ifs for by in the func 
## but by is needed for meaningful benchmarks later
expected <- iris[Species == "setosa", sum(Sepal.Length), NULL]
sapply(1:3, function(i) 
               isTRUE(all.equal(expected,
                                my_fun(Species == "setosa", sum(Sepal.Length), i))))
# [1] TRUE TRUE TRUE

expected <- iris[, .(Sepal.Length), NULL]
sapply(4:6, function(i)
               isTRUE(all.equal(expected,
                                my_fun(my_j = "Sepal.Length", option_sel = i))))
# [1] TRUE TRUE TRUE

Questions

All of the options work but while creating this (admittedly not so) minimal example I had a couple of questions:

  1. To profit the most from data.table, I have to use code which the internal optimizer can profile and, well, optimize [2]. So which of the options 1-3 (4-6 are only here for completeness and lack full flexibility) works "best" with data.table, that is which of these can be internally optimized to take full benefit from data.table? My quick benchmarks showed that the rlang option seems to be the fastest.
  2. I realized that for option 3 I have to provide .SD as data argument in the j part, but not in the i part. This is due to scoping that much is clear. But why does tidy_eval "see" the column names in i but not in j?
  3. Any other solution which can be even optimized further?
  4. Using by with option 5 results in a strange error. Why?

Benchmarks

library(dplyr)
size     <- c(1e6, 1e7, 1e8)
grp_prop <- c(1e-6, 1e-4)

make_bench_dat <- function(size, grp_prop) {
   data.table(x = seq_len(size),
              g = sample(ceiling(size * grp_prop), size, grp_prop < 1))
}

res <- bench::press(
   size = size,
   grp_prop = grp_prop,
   {
      bench_dat <- make_bench_dat(size, grp_prop)
      bench::mark(
         deparse    = my_fun(TRUE, max(x), 1, bench_dat, by = "g"),
         substitute = my_fun(TRUE, max(x), 2, bench_dat, by = "g"),
         rlang      = my_fun(TRUE, max(x), 3, bench_dat, by = "g"), 
         relative = TRUE)
   }
)

summary(res) %>% select(expression, size, grp_prop, min, median)
# # A tibble: 18 x 5
#    expression      size grp_prop      min   median
#    <bch:expr>     <dbl>    <dbl> <bch:tm> <bch:tm>
#  1 deparse      1000000 0.000001  22.73ms  24.36ms
#  2 substitute   1000000 0.000001  22.56ms   25.3ms
#  3 rlang        1000000 0.000001   8.09ms   9.05ms
#  4 deparse     10000000 0.000001 274.24ms 308.72ms
#  5 substitute  10000000 0.000001 276.73ms 276.99ms
#  6 rlang       10000000 0.000001 114.52ms 119.21ms
#  7 deparse    100000000 0.000001    3.79s    3.79s
#  8 substitute 100000000 0.000001    3.92s    3.92s
#  9 rlang      100000000 0.000001    3.12s    3.12s
# 10 deparse      1000000 0.0001    29.57ms  36.25ms
# 11 substitute   1000000 0.0001    37.22ms  41.56ms
# 12 rlang        1000000 0.0001     19.3ms  24.07ms
# 13 deparse     10000000 0.0001   386.13ms 396.84ms
# 14 substitute  10000000 0.0001   330.22ms 332.42ms
# 15 rlang       10000000 0.0001   270.54ms 274.35ms
# 16 deparse    100000000 0.0001      4.51s    4.51s
# 17 substitute 100000000 0.0001       4.1s     4.1s
# 18 rlang      100000000 0.0001      2.87s    2.87s

[1] with = FALSEor ..columnName does however work only in the j part.

[2] I learned that the hard way when I got a significant performance boost when I replaced purrr::map by base::lapply.

thothal
  • 16,690
  • 3
  • 36
  • 71
  • option 4 is invalid, this is not how `with=F` is meant to be used. – jangorecki May 27 '20 at 14:04
  • hmm,despite the fact that it works, I would be curious to learn what the intended use is then? I thought `x <- "Species"; iris[, x, with = FALSE]`is the intended use? (of course this is now superseded by `iris[, ..x]`) – thothal May 27 '20 at 14:35
  • Yes, exactly, or providing integer. It is meant to provide data.frame-like interface, by removing `with()`-like interface that data.table provides. Which was the initial motivation to write and release package back in 2006. – jangorecki May 27 '20 at 15:05
  • So your 1. comment should be read as "option 4 is outdated, `with=F` should nowadays be replaced by the `..x` syntax"? Don't want to be nitpicking, but I was confused by the term "invalid", as apparently it is valid but outdated code. Or do I miss agains something? – thothal May 27 '20 at 15:10
  • 2
    Option is not outdated, but it should be used with integer or character, and not with expression, for expression this option will be ignored. BTW. check https://www.youtube.com/watch?v=qLrdYhizEMg – jangorecki May 27 '20 at 16:48
  • 1
    Ok, but if you kook into my example that's exactly what I have done - used it with a character. Hence my confusion. But good that we are now on the same page, thanks for your explanation! – thothal May 27 '20 at 20:33
  • Sorry then, if you pass characters and not expressions then the whole substitution is redundant – jangorecki May 27 '20 at 22:19
  • 1
    Understood and that's what I wrote: `"(4-6 are only here for completeness and lack full flexibility)"` and in the code`"## option 4 - if passing only simple column name strings we can use "with" (in j only)`. – thothal May 28 '20 at 06:13

1 Answers1

7

No need for fancy tools, just use base R metaprogramming features.

my_fun2 = function(my_i, my_j, by, my_data) {
  dtq = substitute(
    my_data[.i, .j, .by],
    list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by))
  )
  print(dtq)
  eval(dtq)
}

my_fun2(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
my_fun2(my_j = "Sepal.Length", my_data=as.data.table(iris))

This way you can be sure that data.table will use all possible optimizations as when typing [ call by hand.


Note that in data.table we are planning to make substitution easier, see solution proposed in PR Rdatatable/data.table#4304.

Then using extra env var substitute will be handled internally for you

my_fun3 = function(my_i, my_j, by, my_data) {
  my_data[.i, .j, .by, env=list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by)), verbose=TRUE]
}
my_fun3(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
#Argument 'j'  after substitute: sum(Sepal.Length)
#Argument 'i'  after substitute: Species == "setosa"
#...
my_fun3(my_j = "Sepal.Length", my_data=as.data.table(iris))
#Argument 'j'  after substitute: Sepal.Length
#...
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • A classic. Didn't see the wood for the trees. I'll quickly run some benchmarks on that :) But looks so straight forward that I am ashamed that I did not come with that solution myself :) – thothal May 27 '20 at 13:59