9

Consider

target <- "vs"
value <- 1

library(data.table)
dt <- as.data.table(head(mtcars))

So I'm trying to pass both column name and a value as variables into the j expression within data.table environment, something that would be equivalent to

dt[, vs == 1]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

If only the value is the variable, it works nicely

dt[, vs == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

We can also call the column within the data.table scope when it's a variable

dt[, target, with = FALSE]
#    vs
# 1:  0
# 2:  0
# 3:  1
# 4:  1
# 5:  0
# 6:  1

But I can't figure how to combine the two in a simple manner

Note: I'm well aware that I can simply do:

dt[[target]] == value
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

But I need it within the data table scope so I could modify other columns by reference, something like

dt[, NEWCOL := sum(vs == 1), by = am]

So here are my tries when both column name and the value are variables

dt[, target == value, with = FALSE]
# Null data.table (0 rows and 0 cols)
dt[, target == value]
# [1] FALSE
dt[, (target) == value]
# [1] FALSE
dt[, .(target == value)]
# V1
# 1: FALSE
dt[, eval(target) == value]
# [1] FALSE
dt[target %in% value]
## Empty data.table (0 rows) of 11 cols: mpg,cyl,disp,hp,drat,wt...

Eventually I came up with

dt[, .SD[[target]] == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

but it is very inefficient, here's a simple benchmark

set.seed(123)
n <- 1e6
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))
system.time(dt[, NEWCOL := sum(.SD[[target]] == value), by = am])
#  user  system elapsed 
# 13.00    0.02   13.12 
system.time(dt[, NEWCOL2 := sum(vs == value), by = am])
# user  system elapsed 
# 0.82    0.00    0.83 

Question: Is there any better way of doing this that I'm missing here? Something either more idiomatic or much more efficient


Edit

Originally I was looking for something idiomatic, so I thought @GGrothendieck simple solution using get was the one, but surprisingly all @Richard version are beating even the version that ins't doing any evaluation of the column name

set.seed(123)
n <- 1e7
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))

cl <- substitute(
  x == y, 
  list(x = as.name(target), y = value)
)
cl2 <- call("==", as.name(target), value)

system.time(dt[, NEWCOL := sum(vs == value), by = am])
#   user  system elapsed 
#   0.83    0.00    0.82 
system.time(dt[, NEWCOL1 := sum(.SD[[target]] == value), by = am])
#   user  system elapsed 
#   8.97    0.00    8.97 
system.time(dt[, NEWCOL2 := sum(get(target) == value), by = am])
#   user  system elapsed 
#   2.35    0.00    2.37 
system.time(dt[, NEWCOL3 := sum(eval(cl)), by = am])
#   user  system elapsed 
#   0.69    0.02    0.71 
system.time(dt[, NEWCOL4 := sum(eval(cl2)), by = am])
#   user  system elapsed 
#   0.76    0.00    0.77 
system.time(dt[, NEWCOL5 := sum(eval(as.name(target)) == value), by = am])
#   user  system elapsed 
#   0.78    0.00    0.78 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • These all look really strange to me. (@jangorecki's bounty brought me here.) Why use `by` with a row counter? That can't possibly be optimal. And why `sum` a 0/1 scalar? The same vector can be produced 700x faster (on my computer) with `dt[,mycol:=0L];dt[get(target)==value,mycol:=1L]`. Check with `dt[,table(mycol,NEWCOL5)]` – Frank May 26 '15 at 17:30
  • Besides I've learn interesting substitution from Richard's answer I still cannot deal with similar *programming update by reference* on character vector input. Something like: a) `select <- c("value"); DT[JN, c("value") := list(i.value)]` and b) `select <- c("value","meta"); DT[JN, c("value","meta") := list(i.value,i.meta)`. Tried with `lapply(select, as.name(paste0("i.",select)))` but `name`s nested in list seems to not be catched. I may start a new question for that. – jangorecki May 26 '15 at 20:19
  • [here it is](http://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns). – jangorecki May 26 '15 at 20:40

1 Answers1

9

Here is one possible alternative.

target <- "vs"
value <- 1
dt <- as.data.table(head(mtcars))

In terms of code it's not necessarily simpler, but we can set up an unevaluated call cl defined outside the scope of dt which is to be evaluated inside the data table's environment.

cl <- substitute(
    x == y, 
    list(x = as.name(target), y = value)
)

substitute() might be necessary for longer expressions. But in this case, call() would shorten the code and create the same cl result. And so cl could also be

cl <- call("==", as.name(target), value)

Now we can evaluate cl inside dt. On your example this seems to work fine.

dt[, NEWCOL := sum(eval(cl)), by = am][]
#     mpg cyl disp  hp drat    wt  qsec vs am gear carb NEWCOL
# 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4      1
# 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4      1
# 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      1
# 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      2
# 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      2
# 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1      2

After thinking about this for a minute, I'm not sure value needed to be substituted, and hence the following also works. But as David notes, the first approach is more time efficient.

dt[, eval(as.name(target)) == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • I found this answer very valuable, just for preparing calls for flexible usage of `data.table`. Unfortunately I need to wait 23h before I can give you a bounty. BTW. Similar approach instead of `j` arg in `i` argument: [gist](https://gist.github.com/jangorecki/06a05f0efab256765f53). – jangorecki May 25 '15 at 16:40
  • Can you check last comment under the question? maybe it would be simple for you... – jangorecki May 26 '15 at 20:41