Consider
target <- "vs"
value <- 1
library(data.table)
dt <- as.data.table(head(mtcars))
So I'm trying to pass both column name and a value as variables into the j
expression within data.table
environment, something that would be equivalent to
dt[, vs == 1]
# [1] FALSE FALSE TRUE TRUE FALSE TRUE
If only the value is the variable, it works nicely
dt[, vs == value]
# [1] FALSE FALSE TRUE TRUE FALSE TRUE
We can also call the column within the data.table scope when it's a variable
dt[, target, with = FALSE]
# vs
# 1: 0
# 2: 0
# 3: 1
# 4: 1
# 5: 0
# 6: 1
But I can't figure how to combine the two in a simple manner
Note: I'm well aware that I can simply do:
dt[[target]] == value
# [1] FALSE FALSE TRUE TRUE FALSE TRUE
But I need it within the data table scope so I could modify other columns by reference, something like
dt[, NEWCOL := sum(vs == 1), by = am]
So here are my tries when both column name and the value are variables
dt[, target == value, with = FALSE]
# Null data.table (0 rows and 0 cols)
dt[, target == value]
# [1] FALSE
dt[, (target) == value]
# [1] FALSE
dt[, .(target == value)]
# V1
# 1: FALSE
dt[, eval(target) == value]
# [1] FALSE
dt[target %in% value]
## Empty data.table (0 rows) of 11 cols: mpg,cyl,disp,hp,drat,wt...
Eventually I came up with
dt[, .SD[[target]] == value]
# [1] FALSE FALSE TRUE TRUE FALSE TRUE
but it is very inefficient, here's a simple benchmark
set.seed(123)
n <- 1e6
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))
system.time(dt[, NEWCOL := sum(.SD[[target]] == value), by = am])
# user system elapsed
# 13.00 0.02 13.12
system.time(dt[, NEWCOL2 := sum(vs == value), by = am])
# user system elapsed
# 0.82 0.00 0.83
Question: Is there any better way of doing this that I'm missing here? Something either more idiomatic or much more efficient
Edit
Originally I was looking for something idiomatic, so I thought @GGrothendieck simple solution using get
was the one, but surprisingly all @Richard version are beating even the version that ins't doing any evaluation of the column name
set.seed(123)
n <- 1e7
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))
cl <- substitute(
x == y,
list(x = as.name(target), y = value)
)
cl2 <- call("==", as.name(target), value)
system.time(dt[, NEWCOL := sum(vs == value), by = am])
# user system elapsed
# 0.83 0.00 0.82
system.time(dt[, NEWCOL1 := sum(.SD[[target]] == value), by = am])
# user system elapsed
# 8.97 0.00 8.97
system.time(dt[, NEWCOL2 := sum(get(target) == value), by = am])
# user system elapsed
# 2.35 0.00 2.37
system.time(dt[, NEWCOL3 := sum(eval(cl)), by = am])
# user system elapsed
# 0.69 0.02 0.71
system.time(dt[, NEWCOL4 := sum(eval(cl2)), by = am])
# user system elapsed
# 0.76 0.00 0.77
system.time(dt[, NEWCOL5 := sum(eval(as.name(target)) == value), by = am])
# user system elapsed
# 0.78 0.00 0.78