There are multiple ways to select columns of data.table by using a variable holding the desired column names (with=FALSE
, ..
, mget
, ...).
Is there a consensus which to use (when)? Is one more data.table
-y than the others?
I could come up with the following arguments:
with=FALSE
and..
are almost equally fast, whilemget
is slower..
can't select concatenated column names "on the fly" (EDIT: current CRAN version1.12.8
definitely can, I was using an old version, which could not, so this argument is flawed)mget()
is close to the useful syntax ofget()
, which seems to be the only way to use a variable name in a calculation in j
To (1):
library(data.table)
library(microbenchmark)
a <- mtcars
setDT(a)
selected_cols <- names(a)[1:4]
microbenchmark(a[, mget(selected_cols)],
a[, selected_cols, with = FALSE],
a[, ..selected_cols],
a[, .SD, .SDcols = selected_cols])
#Unit: microseconds
# expr min lq mean median uq max neval cld
# a[, mget(selected_cols)] 468.483 495.6455 564.2953 504.0035 515.4980 4341.768 100 c
# a[, selected_cols, with = FALSE] 106.254 118.9385 141.0916 124.6670 130.1820 966.151 100 a
# a[, ..selected_cols] 112.532 123.1285 221.6683 129.9050 136.6115 2137.900 100 a
# a[, .SD, .SDcols = selected_cols] 277.536 287.6915 402.2265 293.1465 301.3990 5231.872 100 b
To (2):
b <- data.table(x = rnorm(1e6),
y = rnorm(1e6, mean = 2, sd = 4),
z = sample(LETTERS, 1e6, replace = TRUE))
selected_col <- "y"
microbenchmark(b[, mget(c("x", selected_col))],
b[, c("x", selected_col), with = FALSE],
b[, c("x", ..selected_col)])
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# b[, mget(c("x", selected_col))] 5.454126 7.160000 21.752385 7.771202 9.301334 147.2055 100 b
# b[, c("x", selected_col), with = FALSE] 2.520474 2.652773 7.764255 2.944302 4.430173 100.3247 100 a
# b[, c("x", ..selected_col)] 2.544475 2.724270 14.973681 4.038983 4.634615 218.6010 100 ab
To (3):
b[, sqrt(get(selected_col))][1:5]
# [1] NaN 1.3553462 0.7544402 1.5791845 1.1007728
b[, sqrt(..selected_col)]
# error
b[, sqrt(selected_col), with = FALSE]
# error
EDIT: added .SDcols
to the benchmark in (1), b[, c("x", ..selected_col)]
to (2).