data.table: using colnames for assignment by reference

Question

I want to use column names for an assignment by reference (:=) within a data.table. The function called is doing some calculation per row over several columns. I use the current development version of data.table (v1.9.7), which makes the parameter "with=TRUE" unnecessary.

A running minimal example with explicit variable names is:

DT <- data.table(a = 1:10, b = seq(2, 20, 2), c = seq(5, 50, 5))
DT[, out := sum(a, b), by = 1:nrow(DT)]

But if I have a lot of columns and I call the function with a single variable containing the (selected) column names, the code fails:

DT  <- data.table(a = 1:10, b = seq(2, 20, 2))
col <- colnames(DT)
DT[, out := sum(col), by = 1:nrow(DT)]

EDIT:

David Arenburg's answer DT[, out := Reduce(+, .SD), .SDcols = col] works for this specific case. But I do not really understand how this approach can be applied to another function call. I wrote the following function to test:

myfun <- function(x, y, ...){
   in.tmp1 <- x
   in.tmp2 <- c(y, ...)
   out.tmp <- in.tmp1 + mean(in.tmp2)
   return(out.tmp)
}

Again, writing explicitly the column names the following approach works:

DT <- data.table(a = 1:10, b = seq(2, 20, 2), c = seq(5, 50, 5))
DT[, out := myfun(a,b,c), by = 1:nrow(DT)]

But I can't work out a more general solution for a large subset within the data.table specified by their columns names.

If you are doing `by = 1:nrow(DT)` you are doing it wrong. I would go with ```DT[, out := Reduce(`+`, .SD), .SDcols = col]``` — David Arenburg, Nov 23 '16 at 13:54
Thanks, this works indeed. But does this work also with other functions (e.g. mean) or own written functions? I got the idea of `by = 1:nrow(DT)` by the answer of this question http://stackoverflow.com/questions/25431307/r-data-table-apply-function-to-rows-using-columns-as-arguments. For my first example, it does work as it is supposed to. — moremo, Nov 23 '16 at 14:14
Well, like eddi said, it is better to vectorize. It depends on the function. The only case when you will use `by = 1:nrow(DT)` is when there is absolutely no other choice. Neither R or `data.table` were designed to work well by row, rather by columns/matrices. Again, it depends on your function. Also, if your data set is small, I guess it's not such a big deal to work by row. — David Arenburg, Nov 23 '16 at 14:18
I find this Q&A (and links therein) quite useful when considering row-wise operations: [How to do row wise operations on .SD columns in data.table](http://stackoverflow.com/questions/33353036/how-to-do-row-wise-operations-on-sd-columns-in-data-table) — Henrik, Nov 23 '16 at 14:21
Thanks you all, but I still haven't managed to call a function with many parameters within the data.table. I think the problem are the quotes. I tried according to this answer http://stackoverflow.com/questions/12603890/pass-column-name-in-data-table-using-variable-in-r to use `col <- quote(c(b,c))` and `DT[, out := myfun(a,eval(col)), by = 1:nrow(DT)]`. This theoretically works, but I still have the problem, that I have to type all e.g. 500 column names by hand. Suggestions anyone?! — moremo, Nov 24 '16 at 10:44
@moe ultimate solution is to build desired `j` call using computing on the language, then just `DT[, eval(j)]`. — jangorecki, Nov 25 '16 at 17:01
@jangorecki thanks for your comment. could you give me a more detailed example, plz? — moremo, Dec 13 '16 at 08:17
@moe example in http://stackoverflow.com/a/37408321/2490497 and http://stackoverflow.com/a/34970993/2490497 — jangorecki, Dec 13 '16 at 18:08

WetRobot · Answer 1 · 2017-02-22T20:33:44.297

Consider the following:

library("data.table")

dt <- data.table(a = 1:5, b = 5:1, c = 1, d = 2, e = 5:1)


myfun <- function(x, y, ...){
  in.tmp1 <- x
  in.tmp2 <- c(y, ...)
  out.tmp <- in.tmp1 + mean(in.tmp2)
  return(out.tmp)
}

my_vars <- c("a", "c", "d")

var_list <- mget(my_vars, envir = as.environment(dt))

names(var_list)[1:2] <- c("x", "y")

dt[, "out" := do.call(myfun, var_list)]

Here we collect an arbitrary set of columns in my_vars to var_list, a list of non-copied aliases for the appropriate columns from dt. It is possible to pass columns as arguments of a function in R using do.call, but the names of the elements in the argument list (here var_list) must match to the names of the arguments of the function (myfun has args "x" and "y" and "...", but the last takes elements of any name).

If you want to make more use of data.table and not use mget, try

## so myfun finds the correct columns for args "x" and "y"
setnames(dt, c("a", "c"), c("x", "y"))

my_vars <- c("x", "y", "d")
dt[, "out" := do.call(myfun, .SD), .SDcols = my_vars]

EDIT 2017-02-22: using unnamed columns also allowed in do.call.

dt[, "out" := do.call(myfun, unname(as.list(.SD))), .SDcols = my_vars]

Thanks a lot! Your solution definitely works. However does it seem to me not as the intended approach using `data.table`. But I also might be wrong.. — moremo, Dec 13 '16 at 08:24
@moe I'm not certain what you mean by intended approach, but for more "data.table-like" code you can try the code block I added in my post. Since `do.call` tries to match columns to function arguments by the names of the columns, an alternative solution is to not have any names at all in the list passed on to `do.call`. This assumes the columns are in the correct order, however. — WetRobot, Feb 22 '17 at 20:36

data.table: using colnames for assignment by reference

1 Answers1