Rolling by group in data.table R

Question

I'm trying to roll my function through data.table by group and run into problems. Not sure should I change my function or is my call wrong. Here is simple example:

Data

 test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))

my function

zoo_fun <- function(dt, N) {
  (rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

Running it (I want to create new column momentum, which would be just product of latest 3 observations added by one for each security (so grouping by=sec).

test[, momentum3 := zoo_fun(test, 3), by=sec]

    Warning messages:
    1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
    2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.

I get that warning and result is not expected:

> test
    return sec momentum3
 1:    0.1   A        NA
 2:    0.1   A        NA
 3:    0.1   A     0.331
 4:    0.1   A     0.331
 5:    0.1   A     0.331
 6:    0.2   B        NA
 7:    0.2   B        NA
 8:    0.2   B     0.331
 9:    0.2   B     0.331
10:    0.2   B     0.331

I was expecting B sec to be filled with 0.728 ((1.2*1.2*1.2) -1) with two NAs in start. What am I doing wrong? Is it that rolling functions won't work with grouping?

score 5 · Answer 1 · edited May 23 '17 at 12:26

This answer suggested to use reduce() and shift() for rolling window problems with data.table. This benchmark showed that this might be considerably faster than zoo::rollapply().

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:    0.1   A       NA
# 2:    0.1   A       NA
# 3:    0.1   A    0.331
# 4:    0.1   A    0.331
# 5:    0.1   A    0.331
# 6:    0.2   B       NA
# 7:    0.2   B       NA
# 8:    0.2   B    0.728
# 9:    0.2   B    0.728
#10:    0.2   B    0.728

Benchmark (10 rows, OP data set)

microbenchmark::microbenchmark(
  zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
  red  = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
  times = 100L
)
#Unit: microseconds
# expr      min       lq      mean   median        uq      max neval cld
#  zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382   100   b
#  red  562.465  625.413  663.4893  646.880  673.4715 1094.771   100  a

Benchmark (100k rows)

To verify the benchmark results with the small data set, a larger data set is constructed:

n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
                   sec = rep(rep(c("A", "B"), each = 5L), n_rows))

test0
#        return sec
#     1:   0.11   A
#     2:   0.12   A
#     3:   0.13   A
#     4:   0.14   A
#     5:   0.15   A
#    ---           
# 99996:   0.21   B
# 99997:   0.22   B
# 99998:   0.23   B
# 99999:   0.24   B
#100000:   0.25   B

As test is being modified in place, each benchmark run is started with a fresh copy of test0.

microbenchmark::microbenchmark(
  copy = test <- copy(test0),
  zoo  = {
    test <- copy(test0)
    test[, momentum := zoo_fun(return, 3), by = sec][]
  },
  red  = {
    test <- copy(test0)
    test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
  },
  times = 10L
)

#Unit: microseconds
# expr         min          lq         mean      median          uq         max neval cld
# copy     282.619     294.512     325.3261     298.424     350.272     414.983    10  a 
#  zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279    10   b
#  red    3354.554    3439.095    6135.8794    5002.008    7695.948   11443.595    10  a

For 100k rows, the Reduce() / shift() approach is more than 200 times faster than the zoo::rollapply().

Apparently, there are different interpretations of what the expected result is.

To investigate this, a modified data set is used:

test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
#    return sec
# 1:   0.10   A
# 2:   0.11   A
# 3:   0.12   A
# 4:   0.13   A
# 5:   0.14   A
# 6:   0.21   B
# 7:   0.22   B
# 8:   0.23   B
# 9:   0.24   B
#10:   0.25   B

Note that the return values within in each group are varying which is different to the OP's data set where the returnvalues for each sec group are constant.

With this, the accepted answer (rollapply()) returns

test[, momentum := zoo_fun(return, 3), by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

Henrik's answer returns:

test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
#    return sec      res
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.442784
# 4:   0.13   A 0.442784
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.906500
# 9:   0.24   B 0.906500
#10:   0.25   B 0.906500

The Reduce()/shift() solution returns:

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

How large is your test data? Benchmarks on very small data can be misleading when applied to (large) real world data — talat, May 17 '17 at 13:53
@docendodiscimus I've added a 100k rows benchmark which confirms the indication of the benchmark results gained with the small OP data set. — Uwe, May 17 '17 at 17:24
Thanks, I'm at state that running speed takes anyway minimal amount of my time (coding takes alot). How about if I want to use different functions in that reduce? sd, mean... do they work as well. I like rollapply because it is easy to change functions. — Hakki, May 18 '17 at 07:12

score 3 · Accepted Answer · answered May 17 '17 at 09:51

When you use dt$return the whole data.table is picked internally within the groups. Just use the column you need in the function definition and it will work fine:

#use the column instead of the data.table
zoo_fun <- function(column, N) {
  (rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]

As a separate note, you should probably not use return as a column or variable name.

Out:

> test
    return sec momentum
 1:    0.1   A       NA
 2:    0.1   A       NA
 3:    0.1   A    0.331
 4:    0.1   A    0.331
 5:    0.1   A    0.331
 6:    0.2   B       NA
 7:    0.2   B       NA
 8:    0.2   B    0.728
 9:    0.2   B    0.728
10:    0.2   B    0.728

Rolling by group in data.table R

2 Answers2

Benchmark (10 rows, OP data set)

Benchmark (100k rows)

Linked

Related