What is the preferred way to programmatically define multiple new data.table columns?

Question

The FAQ states that the preferred way to add a new column to a data.table when programming is to use quote() and then eval(). But what if I want to add several columns at once? Playing around with this I came up with the following solution:

library(data.table)
DT <- data.table(V1=1:1000,
                 V2=2001:3000)
col.names <- c("V3","V4")
col.specs <- vector("list",2)
col.specs[[1]] <- quote(V1**2)
col.specs[[2]] <- quote((V1+V2)/2)

DT[,c(col.names) := lapply(col.specs,eval,envir=DT)]

which yields the desired result:

> head(DT)
   V1   V2 V3   V4
1:  1 2001  1 1001
2:  2 2002  4 1002
3:  3 2003  9 1003
4:  4 2004 16 1004
5:  5 2005 25 1005
6:  6 2006 36 1006

My question is simply: is this the preferred method? Specifically, can someone think of a way to avoid specifying the environment in the lapply() call? If I leave it out I get:

> DT[,c(col.names) := lapply(col.specs,eval)]
Error in eval(expr, envir, enclos) : object 'V1' not found

It may be no big deal, but at least to me it feels a bit suspicious that the data table does not recognise its own columns. Also, if I add the columns one by one, there is no need to specify the environment:

> DT <- data.table(V1=1:1000,
+                  V2=2001:3000)
> col.names <- c("V3","V4")
> col.specs <- vector("list",2)
> col.specs[[1]] <- quote(V1**2)
> col.specs[[2]] <- quote((V1+V2)/2)
> for (i in 1L:length(col.names)) {
+   DT[,col.names[i] := list(eval(col.specs[[i]]))]
+ }
> head(DT)
   V1   V2 V3   V4
1:  1 2001  1 1001
2:  2 2002  4 1002
3:  3 2003  9 1003
4:  4 2004 16 1004
5:  5 2005 25 1005
6:  6 2006 36 1006

That looks good to me. Specifying `envir` also seem to be a good practice in order to avoid unexpected results (if lets say `DT` doesn't have `V1` and the global environment does- would you want it to work still?). — David Arenburg, Feb 21 '17 at 13:09
@DavidArenburg that is a good point. I guess I am struggling a bit due to the fact that I rarely use `eval`. It does seem to recognise _functions_ that are in the global environment. If I write a new function `myfunc <- function(x)return(x**2)`and use that function instead, i.e. `col.specs[[1]] <- quote(myfunc(V1))`, then the code still executes, even though `myfunc` is not (I guess?) in DT's environment. — Ola Caster, Feb 21 '17 at 13:29
Also I think it is a bit counter-intuitive that the same does not apply when columns are added one at the time. (Edited post to reflect this.) — Ola Caster, Feb 21 '17 at 14:18
Just to have fewer objects floating around, I'd name col.specs' elements after col.names and delete the latter. And if it's easy to manage, I'd try combining them into a single quoted expression instead of using a list. Btw, regarding environments, R will always search the immediate environment first, then the parent, then the parent's parent, and so on. So if it exists in the global env, it will work (is my understanding). — Frank, Feb 21 '17 at 15:19
@Frank thanks for your comment. The used variables were not given much thought; the aim was just to make the question clear enough. I agree that using a single quoted expression would be preferable, but I didn't manage to find a way to combine them. (In typical use, of course, the specific set of such expression would change from time and would not be known explicitly like in this simple example.) — Ola Caster, Feb 21 '17 at 15:27

score 1 · Accepted Answer · edited May 23 '17 at 12:24

Since things are easier with a single quoted expression...

library(data.table)
DT <- data.table(V1=1:1000, V2=2001:3000)

new_cols = list(
  V3 = quote(V1**2),
  v4 = quote((V1+V2)/2)
)

e = as.call(c(quote(`:=`), new_cols))
DT[, eval(e)]

Then you can freely add to or edit new_cols with the names in close proximity to the exprs.

Sources: Arun, and me citing him before.

Side note. The syntax above is

`:=`(col = v, col2 = v2, ...)

But we should also be able to do

c("col", "col2") := list(v, v2)
# aka
`:=`(c("col", "col2"), list(v, v2))

However, I can't figure out how to do it:

DT <- data.table(V1=1:1000, V2=2001:3000)
e2 = as.expression(list(quote(`:=`), names(new_cols), unname(new_cols)))
# gives an error:
DT[, eval(e2)]

# even though it works when written directly:
DT2[, `:=`(c("V3", "v4"), list(V1^2, (V1 + V2)/2))]

I'd like to know how to do that, though...

What is the preferred way to programmatically define multiple new data.table columns?

1 Answers1