0

I'm trying to improve the efficiency of the following simple data.table syntax, so I'm trying to combine it into one call without repeatedly calling by = "group".

#data
library(data.table)
DT <- data.table(group = c(rep("a", 40), rep("b", 40)),
                 other = rnorm(80),
                 num = c(1:80))

#reduce this to one "by" call
DT[, c1 := ifelse(num <= 7, NA, num), by = "group"]
DT[, sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'), by = "group"]
DT[, d1 := shift(c10, 1, type = 'lag'), by = "group"]
DT[, sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'), by = "group"]
DT[, e1 := shift(d10, 1, type = 'lag'), by = "group"]
DT[, sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag'), by = "group"]

Something like

DT[, .(c1 := ifelse(num <= 7, NA, num),
       sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'),
       d1 := shift(c10, 1, type = 'lag'),
       sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'),
       e1 := shift(d10, 1, type = 'lag'),
       sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag')), by = "group"]

Edit:

This is similar but slightly different to this question as the variables created here are not independent of one another.

Any suggestions?

Thanks

user63230
  • 4,095
  • 21
  • 43
  • As I am aware you can not use in one `data.table` call just defined variable like in dplyr – minem Apr 08 '20 at 11:32
  • interesting, so are you suggesting running line by line is the most efficient way? I'd have doubts as it seems repetitive. Maybe setting a key (`setkey`) alleviates some of the issues – user63230 Apr 08 '20 at 11:54
  • depends on your data. Maybe setting key on `group` could yield minor improvement. – minem Apr 08 '20 at 12:10
  • 1
    Does this answer your question? [Add multiple columns to R data.table in one function call?](https://stackoverflow.com/questions/11308754/add-multiple-columns-to-r-data-table-in-one-function-call) as well as https://stackoverflow.com/questions/11680579/assign-multiple-columns-using-in-data-table-by-group – Cole Apr 08 '20 at 21:25
  • 1
    @minem you can absolutely do multiple updates in [tag:data.table]. There are two syntaxes: ```dt[, `:=`(z1 = x, z2 = y)]``` or ```dt[, c('z1', 'z2') := .(x, y)]``` Also for @OP, please choose minimal examples. – Cole Apr 08 '20 at 21:27
  • 2
    @Cole but in your example you can not use `z1` to calculate `z2`, that is what I meant – minem Apr 09 '20 at 05:41
  • @Cole as minem pointed out those questions assume that the variables are independent whereas in my question they are not. I had seen them but think this is slightly different – user63230 Apr 09 '20 at 07:20
  • @minem i see and agree. The ```{...}``` allows for intermediate variables and it is actually how [tag:dtplyr] translates ```mutate``` . – Cole Apr 09 '20 at 11:32

3 Answers3

2

Here is an option:

ix <- 2L:10L
m <- 1L:9L
DT[, c(sprintf("c%d", ix), sprintf("d%d", ix), sprintf("e%d", ix)) := {
    c1 = replace(num, num <= 7L, NA_integer_)
    lc = shift(c1, m)

    d1 = shift(lc[[9L]])
    ld = shift(d1, m)

    e1 = shift(ld[[9L]])
    c(lc, ld, shift(e1, m))
}, group]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
1
# You can write function:
f <- function(num) {
  c1 <- ifelse(num <= 7, NA, num)
  cl <- shift(c1, 1:9, type = 'lag')
  names(cl) <- sprintf("c%d", 2:10)
  d1 <- shift(cl[9], 1, type = 'lag')
  dl <- shift(d1, 1:9, type = 'lag')
  names(dl) <- sprintf("d%d", 2:10)
  e1 <-  shift(dl[9], 1, type = 'lag')
  el <- shift(e1, 1:9, type = 'lag')
  names(el) <- sprintf("e%d", 2:10)
  c(c1 = list(c1), cl, d1 = d1, dl, e1 = e1, el) # list of desired columns
}

x <- DT[, f(num), by = group] # apply it by group
DT <- cbind(DT, x[, -'group']) # add to initial data

Maybe this will be faster. Also, the function probably could be written better. Make sure that the function return list with your desired column names.

minem
  • 3,640
  • 2
  • 15
  • 29
1

You can call by once using the fact that (1) every column in the j argument of a data.table becomes a column in the return data.table, and that (2) curly braces can be used for intermediate calculations in j. Because the default value of the argument type in the shift function is lag, I did not specify it. Note that the last line in the curly braces, lst, is the only object returned.

DT[, {
  nms = paste0(rep(c("c", "d", "e"), each = 10), 1:10)
  lst = setNames(vector("list", 30), nms) 

  lst[["c1"]] = ifelse(num <= 7, NA, num)
  lst[sprintf("c%d", 2:10)] = shift(lst[["c1"]], 1:9)

  lst[["d1"]] = shift(lst[["c10"]], 1)
  lst[sprintf("d%d", 2:10)] = shift(lst[["d1"]], 1:9)

  lst[["e1"]] = shift(lst[["d10"]], 1)
  lst[sprintf("e%d", 2:10)] = shift(lst[["e1"]], 1:9)
  lst
}, by = group]

The output contains 30 columns: c1, ...,c10, d1,...,d10 and e1,...,e10