6

I have the following data table:

library(data.table)
dat = data.table(j = c(3,8,9,11,10,28), gr = c(9,9,9,9,10,10))
> dat
    j gr
1:  3  9
2:  8  9
3:  9  9
4: 11  9
5: 10 10
6: 28 10

There are two groups (specified by 'gr') and they are ordered. Now what I would like to achieve is to create for each row of each group a cumulative vector of values in 'j'. The result should be a list column, like this:

res_dat = data.table(j = c(3,8,9,11,10,28), gr = c(9,9,9,9,10,10),
                     res = list(3, c(3,8), c(3,8,9), c(3,8,9,11),
                                10, c(10, 28)))
> res_dat
    j gr         res
1:  3  9           3
2:  8  9         3,8
3:  9  9       3,8,9
4: 11  9  3, 8, 9,11
5: 10 10          10
6: 28 10       10,28

I tried the following:

First I created a dummy column with the number per row for each group.

dat[, tmp:= seq_len(.N), by = gr]

My plan was to use that number in order to subset the j vector, but I have not managed. None of these worked:

dat[, res := list(j[1:tmp]), by = gr]
dat[, res := list(list(j[1:tmp])), by = gr] # based on https://stackoverflow.com/questions/22531477/using-lists-inside-data-table-columns

I get the following error:

Warning messages:
1: In 1:tmp : numerical expression has 4 elements: only the first used
2: In 1:tmp : numerical expression has 2 elements: only the first used

which does help understand how it fails but I don't know how I can make it succeed. Any ideas?

Henrik
  • 65,555
  • 14
  • 143
  • 159
User2321
  • 2,952
  • 23
  • 46
  • 4
    One possibility: `dat[, res := .(sapply(seq_len(.N), function(x) j[seq_len(x)])), by = gr]` – Henrik Aug 15 '21 at 14:37
  • While `Reduce(c, j, accumulate=TRUE)` works on the console, it does not inside the `dat` environment, frustrating. @Henrik's comment is (I believe) the simplest, though I suggest the use of `lapply` or `sapply(., simplify=FALSE)` to preclude corner-case differences. – r2evans Aug 15 '21 at 15:15
  • 2
    @r2evans Wrap it in `list`: `dat[, res2 := .(Reduce(c, j, accumulate=TRUE)), by = gr]` – Henrik Aug 15 '21 at 15:20
  • 1
    Dang it, I thought I had tried that. Thanks @Henrik ... might as well answer with that! It's elegant and `data-table`-canonical. (It seems odd to me that it was taking the list return value and discarding most of it in assignment. The return value from `Reduce(..., accum=T)` is already a `list`, not sure why a double-list is strictly necessary here.) – r2evans Aug 15 '21 at 15:22

2 Answers2

7

This is Henrik's answer (and if they come back, I'll happy give this answer to them ... somehow):

dat[, res := .(Reduce(c, j, accumulate=TRUE)), by = gr]
#        j    gr         res
#    <num> <num>      <list>
# 1:     3     9           3
# 2:     8     9         3,8
# 3:     9     9       3,8,9
# 4:    11     9  3, 8, 9,11
# 5:    10    10          10
# 6:    28    10       10,28

Reduce is similar to sapply except that it operates on the current value and results of the previous operation. For instance, we can see

sapply(1:3, function(z) z*2)
# [1] 2 4 6

This, unrolled, equates to

1*2 # 2
2*2 # 4
3*2 # 6

That is, the calculation on one element of the vector/list is completely independent, never knowing the results from previous iterations.

However, Reduce is explicitly given the results of the previous calculation. By default, it will only return the last calculation, which would be analogous to tail(sapply(...), 1):

Reduce(function(prev, this) prev + this*2, 11:13)
# [1] 61

That seems a bit obscure ... let's look at all of the interim steps, where the answer above is the last:

Reduce(function(prev, this) prev + this*2, 11:13, accumulate = TRUE)
# [1] 11 35 61

In this case (without specifying init=, wait for it), the first result is just the first value in x=, not run through the function. If we unroll this, we'll see

11        # 11 is the first value in x
   _________/
  /
 v
11 + 12*2 # 35
35 + 13*2 # 61

Sometimes we need the first value in x= to be run through the function, with a starting condition (a first-time value for prev when we don't have a previous iteration to use). For that, we can use init=; we can think of the use of init= by looking at two perfectly-equivalent calls:

Reduce(function(prev, this) prev + this*2, 11:13, accumulate = TRUE)
Reduce(function(prev, this) prev + this*2, 12:13, init = 11, accumulate = TRUE)
# [1] 11 35 61

(Without init=, Reduce will take the first element of x= and assign it to init= and remove it from x=.)

Now let's say we want the starting condition (injected "previous" value) to be 0, then we would do

Reduce(function(prev, this) prev + this*2, 11:13, init = 0, accumulate = TRUE)
# [1]  0 22 46 72


### unrolled
 0        # 0 is the init= value
   ________/
  /
 v
 0 + 11*2 # 22
22 + 12*2 # 46
46 + 13*2 # 72

Let's bring that back to this question and this data. I'll inject a browser() and change the function a little so that we can look at all intermediate values.

> dat[, res := .(Reduce(function(prev, this) { browser(); c(prev, this); }, j, accumulate=TRUE)), by = gr]
Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 2
[1] 3
Browse[2]> this
[1] 8
Browse[2]> c(prev, this)
[1] 3 8
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 3
[1] 3 8
Browse[2]> this
[1] 9
Browse[2]> c(prev, this)
[1] 3 8 9
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 4
[1] 3 8 9
Browse[2]> this
[1] 11
Browse[2]> c(prev, this)
[1]  3  8  9 11
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=10`, row 6
[1] 10
Browse[2]> this
[1] 28
Browse[2]> c(prev, this)
[1] 10 28
Browse[2]> c                                       # 'c'ontinue

Notice how we didn't "see" rows 1 or 5, since they were the init= conditions for the reduction (the first prev value seen in each group).

Reduce can be a difficult function to visualize and work with. When I use it, I almost always pre-insert browser() into the anon-function and walk through the first three steps: the first to ensure the init= is correct, the second to make sure the anon-function is doing what I think I want with the init and next value, and the third to make sure that it continues properly. This is similar to "Proof by Deduction": the nth calc will be correct because we know the (n-1)th calc is correct.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    Glad you posted @r2evans! I was busy with my coffee. Cheers – Henrik Aug 15 '21 at 15:50
  • 2
    *highfive* for gender neutral language. – Martin Gal Aug 15 '21 at 15:52
  • This is perfect thank you both very much! May I ask that just for posterity you add a brief explanation for the code? – User2321 Aug 16 '21 at 04:34
  • 1
    Does that help explain it? – r2evans Aug 16 '21 at 12:10
  • 1
    That's much more than I had in mind! Thank you – User2321 Aug 19 '21 at 11:13
  • 1
    Glad it helped. I believe `sapply` and `lapply` are relatively straight-forward for most; `apply` is only slightly more to grasp. However, `mapply`/`Map` can be a bit more difficult for some, and `Reduce` can be easily misunderstood. Walking through like this seems to be (imho) a good way to visualize what it is doing under the hood. – r2evans Aug 19 '21 at 11:24
3

Here is a simple tidyverse solution, unfortunately I haven't started learning data.table but until you get a relevant solution it will do fine:

library(dplyr)
library(purrr)

dat %>%
  group_by(gr) %>%
  mutate(res = accumulate(j[-1], .init = j[1], ~ c(.x, .y)))

# A tibble: 6 x 3
# Groups:   gr [2]
      j    gr res      
  <dbl> <dbl> <list>   
1     3     9 <dbl [1]>
2     8     9 <dbl [2]>
3     9     9 <dbl [3]>
4    11     9 <dbl [4]>
5    10    10 <dbl [1]>
6    28    10 <dbl [2]>

Or in base R we could do:

do.call(rbind, lapply(unique(dat$gr), function(a) {
  tmp <- subset(dat, gr == a)
  tmp$res <- Reduce(c, tmp$j, accumulate = TRUE)
  tmp
}))

Or even this terrific yet concise suggestion by dear @Henrik:

do.call(rbind, by(dat, dat$gr, function(d){
  cbind(d, res = I(Reduce(c, d$j, accumulate=TRUE)))}))

Or even this one also by dear @Henrik:

dat$res = ave(dat$j, dat$gr, FUN = function(x) Reduce(c, x, accumulate=TRUE))
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41