R data.table: efficiently access and update a variable column name in j expression with grouping

Question

I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.

Here is the setup and what I would like to achieve:

library(data.table)

set.seed(123)
n <- 1000
DT <- data.table(
        date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
        A = runif(n),
        B = rnorm(n),
        C = rexp(n))

DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]

The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.

library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
        date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
        A = runif(n),
        B = rnorm(n),
        C = rexp(n))

columns <- c("A", "B", "C")

for (x in columns) {
    # This doesn't work.
    # target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]

    # This doesn't work.
    #target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]

    # THIS WORKS! But it is tedious writing "get(x)" every time. 
    target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]

    set(DT, j = paste0(x, ".prime"), value = target)
}

Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:

How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?

score 3 · Accepted Answer · answered Jul 23 '20 at 05:32

3

You can use .SDcols to specify the columns that you want to operate on :

library(data.table)

columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")

DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
                      year(date), .SDcols = columns]

This avoids using get(x) everytime and updates data.table by reference.

answered Jul 23 '20 at 05:32

Ronak Shah

377,200
20
156
213

Exactly what I wanted, thanks a lot! – Kevin Jul 23 '20 at 17:10

score 2 · Answer 2 · answered Jul 23 '20 at 08:38

I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:

target <- DT[ , by = year(date), {
  xval = eval(as.name(x))
  (xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1

Two other small differences:

I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.

You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

Thanks for this answer Michael, learned a lot! – Kevin Jul 23 '20 at 17:09 — Kevin, Jul 23 '20 at 17:09

R data.table: efficiently access and update a variable column name in j expression with grouping

2 Answers2