2

I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.

Here is the setup and what I would like to achieve:

library(data.table)

set.seed(123)
n <- 1000
DT <- data.table(
        date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
        A = runif(n),
        B = rnorm(n),
        C = rexp(n))

DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]

The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.

library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
        date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
        A = runif(n),
        B = rnorm(n),
        C = rexp(n))

columns <- c("A", "B", "C")

for (x in columns) {
    # This doesn't work.
    # target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]

    # This doesn't work.
    #target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]

    # THIS WORKS! But it is tedious writing "get(x)" every time. 
    target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]

    set(DT, j = paste0(x, ".prime"), value = target)
}

Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:

  1. How to avoid typing out get(x) every time I use x to access a column?
  2. Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
Kevin
  • 740
  • 6
  • 13

2 Answers2

3

You can use .SDcols to specify the columns that you want to operate on :

library(data.table)

columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")

DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
                      year(date), .SDcols = columns]

This avoids using get(x) everytime and updates data.table by reference.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:

target <- DT[ , by = year(date), {
  xval = eval(as.name(x))
  (xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1

Two other small differences:

  • I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
  • I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.

You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198