16

There are multiple ways to select columns of data.table by using a variable holding the desired column names (with=FALSE, .., mget, ...).

Is there a consensus which to use (when)? Is one more data.table-y than the others?

I could come up with the following arguments:

  1. with=FALSE and .. are almost equally fast, while mget is slower
  2. .. can't select concatenated column names "on the fly" (EDIT: current CRAN version 1.12.8 definitely can, I was using an old version, which could not, so this argument is flawed)
  3. mget() is close to the useful syntax of get(), which seems to be the only way to use a variable name in a calculation in j

To (1):

library(data.table)
library(microbenchmark)

a <- mtcars
setDT(a)

selected_cols <- names(a)[1:4]

microbenchmark(a[, mget(selected_cols)],
               a[, selected_cols, with = FALSE],
               a[, ..selected_cols],
               a[, .SD, .SDcols = selected_cols])

#Unit: microseconds
#                             expr     min       lq     mean   median       uq      max neval cld
#          a[, mget(selected_cols)] 468.483 495.6455 564.2953 504.0035 515.4980 4341.768   100   c
#  a[, selected_cols, with = FALSE] 106.254 118.9385 141.0916 124.6670 130.1820  966.151   100 a  
#              a[, ..selected_cols] 112.532 123.1285 221.6683 129.9050 136.6115 2137.900   100 a  
# a[, .SD, .SDcols = selected_cols] 277.536 287.6915 402.2265 293.1465 301.3990 5231.872   100  b 

To (2):

b <- data.table(x = rnorm(1e6), 
                y = rnorm(1e6, mean = 2, sd = 4), 
                z = sample(LETTERS, 1e6, replace = TRUE))

selected_col <- "y"

microbenchmark(b[, mget(c("x", selected_col))],
               b[, c("x", selected_col), with = FALSE],
               b[, c("x", ..selected_col)])
# Unit: milliseconds
#                                    expr      min       lq      mean   median       uq      max neval cld
#         b[, mget(c("x", selected_col))] 5.454126 7.160000 21.752385 7.771202 9.301334 147.2055   100   b
# b[, c("x", selected_col), with = FALSE] 2.520474 2.652773  7.764255 2.944302 4.430173 100.3247   100  a 
#             b[, c("x", ..selected_col)] 2.544475 2.724270 14.973681 4.038983 4.634615 218.6010   100  ab

To (3):

b[, sqrt(get(selected_col))][1:5]
# [1] NaN 1.3553462 0.7544402 1.5791845 1.1007728

b[, sqrt(..selected_col)]
# error

b[, sqrt(selected_col), with = FALSE]
# error

EDIT: added .SDcols to the benchmark in (1), b[, c("x", ..selected_col)] to (2).

bendae
  • 779
  • 3
  • 13
  • 2
    For completness you might want to include `a[, .SD, .SDcols = selected_cols]` – s_baldur Feb 17 '20 at 13:18
  • 2
    In (2) concatenation of columns on the fly can be done using: `b[, c("x", ..selected_col)]` – G. Grothendieck Feb 17 '20 at 13:25
  • 2
    Also in (3) double square brackets can be used: `b[, sqrt(.SD[[selected_col]])]` or `sqrt(b[[selected_col]])` – G. Grothendieck Feb 17 '20 at 13:34
  • 2
    You can always compute on the language: `eval(substitute(b[, .(x, selected_col)], list(selected_col = as.name(selected_col))))` – Roland Feb 17 '20 at 13:42
  • Indeed, `b[, c("x", ..selected_col)]` works. I had tried it only with a (very) old version of data.table – bendae Feb 17 '20 at 13:44
  • @G.Grothendieck Concerning double square brakcets for (3): that is true, although that syntax doesn't work as beautifully when trying to assign the result to a new_col, e.g. `b[, new_col := sqrt(get(selected_col)) + sqrt(x)]` – bendae Feb 17 '20 at 13:54
  • To create a new column use `newcol <- "z"; b[, c(newcol) := 1:10]` – G. Grothendieck Feb 17 '20 at 14:34
  • That's not what I meant. My point was that using `b[, new_col := sqrt(b[[selected_col]])]` doesn't read as nicely as the `get()` version, because `b` is reference twice, and I find that using square brackets inside the ones of `data.table` can be confusing to some. – bendae Feb 17 '20 at 14:46
  • You can replace the inner b with .SD to avoid referencing it twice. – G. Grothendieck Feb 17 '20 at 15:55
  • when i increase the num of rows to 1e8, the timings are still around 900ms to 2s. Are you going to copy this large number of rows many times for the timings to be significant? – chinsoon12 Feb 18 '20 at 00:51
  • 1
    I don't plan on doing that, specifically. Just thought it might be a good idea to add timings to the post, because I'm struggling to find arguments that support one way or another. – bendae Feb 18 '20 at 08:13

1 Answers1

3

Should I use mget(), .. or with=FALSE to select columns of a data.table?

You should use whatever is your preference, as long as it is not deprecated of course. I don't see any realistic use case when performance differences across presented solutions would be making real difference. The are some arguments for using with=FALSE over other interfaces but those are more related to maintenance of those interfaces, and not really user usage.


In recent data.table version, starting from 1.14.1, there is a new feature for working with data.table in a way that enables deep parameterizing data.table queries. This new interface, let's call it "env arg" can be used to solve the problem in your question. Yes, another way to solve your problem. This env arg interface is much more generic, so in such a simply use case I would still use with=FALSE. Below I added verbose=TRUE to this new interface usage so readers can see how queries were pre-processed for substitutions of variables.

b = data.table(x = 1L, y = 2, z = "c")
selected_col = "y"

b[, c("x", selected_col), with=FALSE]
#       x     y
#   <int> <num>
#1:     1     2

b[, .cols, env=list(.cols=I(c("x",selected_col))), verbose=T]
#Argument 'j'  after substitute: c("x", "y")
#       x     y
#   <int> <num>
#1:     1     2

b[, .cols, env=list(.cols=as.list(c("x",selected_col))), verbose=T]
#Argument 'j'  after substitute: list(x, y)
#       x     y
#   <int> <num>
#1:     1     2

New env interface will also nicely support (3)

b[, sqrt(.col), env=list(.col=selected_col), verbose=T]
#Argument 'j'  after substitute: sqrt(y)
#[1] 1.414214
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • @bendae please consider accepting my answer. I think it well address your question. If it doesn't please provide feedback.. – jangorecki Jun 03 '21 at 15:35
  • Thank you very much, I didn't yet know about env or that using verbose within [ ] was a possibility. – bendae Jun 07 '21 at 18:16