combining separate data.table calls into one by group call

Question

I'm trying to improve the efficiency of the following simple data.table syntax, so I'm trying to combine it into one call without repeatedly calling by = "group".

#data
library(data.table)
DT <- data.table(group = c(rep("a", 40), rep("b", 40)),
                 other = rnorm(80),
                 num = c(1:80))

#reduce this to one "by" call
DT[, c1 := ifelse(num <= 7, NA, num), by = "group"]
DT[, sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'), by = "group"]
DT[, d1 := shift(c10, 1, type = 'lag'), by = "group"]
DT[, sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'), by = "group"]
DT[, e1 := shift(d10, 1, type = 'lag'), by = "group"]
DT[, sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag'), by = "group"]

Something like

DT[, .(c1 := ifelse(num <= 7, NA, num),
       sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'),
       d1 := shift(c10, 1, type = 'lag'),
       sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'),
       e1 := shift(d10, 1, type = 'lag'),
       sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag')), by = "group"]

Edit:

This is similar but slightly different to this question as the variables created here are not independent of one another.

Any suggestions?

Thanks

As I am aware you can not use in one `data.table` call just defined variable like in dplyr — minem, Apr 08 '20 at 11:32
interesting, so are you suggesting running line by line is the most efficient way? I'd have doubts as it seems repetitive. Maybe setting a key (`setkey`) alleviates some of the issues — user63230, Apr 08 '20 at 11:54
depends on your data. Maybe setting key on `group` could yield minor improvement. — minem, Apr 08 '20 at 12:10
Does this answer your question? [Add multiple columns to R data.table in one function call?](https://stackoverflow.com/questions/11308754/add-multiple-columns-to-r-data-table-in-one-function-call) as well as https://stackoverflow.com/questions/11680579/assign-multiple-columns-using-in-data-table-by-group — Cole, Apr 08 '20 at 21:25
@minem you can absolutely do multiple updates in [tag:data.table]. There are two syntaxes: ```dt[, `:=`(z1 = x, z2 = y)]``` or ```dt[, c('z1', 'z2') := .(x, y)]``` Also for @OP, please choose minimal examples. — Cole, Apr 08 '20 at 21:27
@Cole but in your example you can not use `z1` to calculate `z2`, that is what I meant — minem, Apr 09 '20 at 05:41
@Cole as minem pointed out those questions assume that the variables are independent whereas in my question they are not. I had seen them but think this is slightly different — user63230, Apr 09 '20 at 07:20
@minem i see and agree. The ```{...}``` allows for intermediate variables and it is actually how [tag:dtplyr] translates ```mutate``` . — Cole, Apr 09 '20 at 11:32

score 2 · Accepted Answer · answered Apr 08 '20 at 22:21

Here is an option:

ix <- 2L:10L
m <- 1L:9L
DT[, c(sprintf("c%d", ix), sprintf("d%d", ix), sprintf("e%d", ix)) := {
    c1 = replace(num, num <= 7L, NA_integer_)
    lc = shift(c1, m)

    d1 = shift(lc[[9L]])
    ld = shift(d1, m)

    e1 = shift(ld[[9L]])
    c(lc, ld, shift(e1, m))
}, group]

score 1 · Answer 2 · answered Apr 08 '20 at 12:08

# You can write function:
f <- function(num) {
  c1 <- ifelse(num <= 7, NA, num)
  cl <- shift(c1, 1:9, type = 'lag')
  names(cl) <- sprintf("c%d", 2:10)
  d1 <- shift(cl[9], 1, type = 'lag')
  dl <- shift(d1, 1:9, type = 'lag')
  names(dl) <- sprintf("d%d", 2:10)
  e1 <-  shift(dl[9], 1, type = 'lag')
  el <- shift(e1, 1:9, type = 'lag')
  names(el) <- sprintf("e%d", 2:10)
  c(c1 = list(c1), cl, d1 = d1, dl, e1 = e1, el) # list of desired columns
}

x <- DT[, f(num), by = group] # apply it by group
DT <- cbind(DT, x[, -'group']) # add to initial data

Maybe this will be faster. Also, the function probably could be written better. Make sure that the function return list with your desired column names.

score 1 · Answer 3 · answered Apr 08 '20 at 22:50

You can call by once using the fact that (1) every column in the j argument of a data.table becomes a column in the return data.table, and that (2) curly braces can be used for intermediate calculations in j. Because the default value of the argument type in the shift function is lag, I did not specify it. Note that the last line in the curly braces, lst, is the only object returned.

DT[, {
  nms = paste0(rep(c("c", "d", "e"), each = 10), 1:10)
  lst = setNames(vector("list", 30), nms) 

  lst[["c1"]] = ifelse(num <= 7, NA, num)
  lst[sprintf("c%d", 2:10)] = shift(lst[["c1"]], 1:9)

  lst[["d1"]] = shift(lst[["c10"]], 1)
  lst[sprintf("d%d", 2:10)] = shift(lst[["d1"]], 1:9)

  lst[["e1"]] = shift(lst[["d10"]], 1)
  lst[sprintf("e%d", 2:10)] = shift(lst[["e1"]], 1:9)
  lst
}, by = group]

The output contains 30 columns: c1, ...,c10, d1,...,d10 and e1,...,e10

combining separate data.table calls into one by group call

3 Answers3