R data.table values depending on previous row

Question

I am trying to solve the problem proposed here. Basically what I need is, for each row of the data.table, take the values of each variable in the previous row and use them to define the variable values in the following row.

I have tried using data.table, but the result is quite bulky and I believe extremely inefficient (especially for a big number of rows). I also tried using the shift() function, but could not fit it in my temporary solution.

Here is a toy example:

library(data.table)
DT = data.table(a = numeric(10L), b = numeric(10L), c = numeric(10L), iter = 1:10)

for(i in DT[,.I]){

  DT[i, c('a','b','c') := {
    if(iter == 1) {
      a = 1
      b = 2
      c = 3
    } else { # if it is not the first iteration
      a = DT[i-1, a + b] # read the values from the previous row to compute the new values
      b = DT[i-1, b] - a
      c = a / b + DT[i-1, c]
    }

    .(a, b, c)
  }]

}

and here's the output:

     a  b          c iter
 1:  1  2  3.0000000    1
 2:  3 -1  0.0000000    2
 3:  2 -3 -0.6666667    3
 4: -1 -2 -0.1666667    4
 5: -3  1 -3.1666667    5
 6: -2  3 -3.8333333    6
 7:  1  2 -3.3333333    7
 8:  3 -1 -6.3333333    8
 9:  2 -3 -7.0000000    9
10: -1 -2 -6.5000000   10

Can someone help me improve the code?

Is there a specific reason which requires to use `data.table` instead of rowwise matrix computations? — Uwe, May 28 '19 at 16:00
@Uwe absolutely none, I just assumed data.table could be the most computationally efficient way to solve this problem. — Jinglestar, May 28 '19 at 16:24

Frank · Answer 1 · 2019-05-28T19:48:34.290

4

Note: This is not a general answer to the OP's problem, just to the toy example posted.

Your iterations for a and b are on a cycle every six iterations, and c is a cumulative sum. As a result, it does not have to be computed iteratively, but has a closed form solution for any iteration #:

f = function(i, a0 = 1, b0 = 2, c0 = 2.5){
  trio = c(a0, a0+b0, b0)
  a = c(trio, -trio)
  b = -c(tail(a, 1L), head(a, -1L))

  cs = cumsum(a/b)
  c6 = tail(cs, 1L)

  k = (i - 1L) %/% 6L
  ii = 1L + (i - 1L) %% 6L

  list(a = a[ii], b = b[ii], c = c0 + k*c6 + cs[ii])
}

library(data.table)
DT = data.table(iter = 1:10)[, c("a", "b", "c") := f(iter)][]

    iter  a  b          c
 1:    1  1  2  3.0000000
 2:    2  3 -1  0.0000000
 3:    3  2 -3 -0.6666667
 4:    4 -1 -2 -0.1666667
 5:    5 -3  1 -3.1666667
 6:    6 -2  3 -3.8333333
 7:    7  1  2 -3.3333333
 8:    8  3 -1 -6.3333333
 9:    9  2 -3 -7.0000000
10:   10 -1 -2 -6.5000000

That is, you can just skip ahead to any iteration:

> setDT(f(10))[]
    a  b    c
1: -1 -2 -6.5
> setDT(f(100))[]
    a  b      c
1: -1 -2 -101.5

edited May 28 '19 at 19:48

answered May 28 '19 at 14:35

Frank

66,179
8
96
180

I'd be interested to know how you recognized that the results cycle. Did you just notice it in the results or can you tell from the formulas? – IceCreamToucan May 28 '19 at 15:54
1

@Frank that's brilliant! Unfortunately this recursive pattern appears only in the example, or at least I wasn't able to detect any distinct rule for my real data. What I need is a solution that works also without clear patterns... My fault for picking a peculiar example! – Jinglestar May 28 '19 at 16:27
1

@IceCreamToucan I reached the conclusion iteratively :) I noticed that b could be simplified to `DT[i-1, -a]` but figured it wasn't very helpful to leave a comment so then kept simplifying. Usually, I'd stop at a cumulative sum (as in an earlier edit) but thought about it more and realized it could be fully closed form because it's a cumsum of the same pattern over and over. – Frank May 28 '19 at 19:39
@Jinglestar Ah, in that case, this isn't any use, I suspect, though usually it seems that these iterations can be simplified to a cumulative sum or similar, so you might want to look at `?cumsum` or `?cumprod` for your real pattern. And maybe also `?replace` eg in https://stackoverflow.com/questions/50032823/r-rowdies-iteration-on-a-data-table-again/50047119#50047119 – Frank May 28 '19 at 19:42

IceCreamToucan · Answer 2 · 2019-05-28T16:56:14.920

You can use Reduce with acumulate = T

fun <- function(x, junk){
 x[1] <- sum(x[1:2])
 x[2] <- diff(x[1:2])
 x[3] <- x[1]/x[2] + x[3]
 x
}

dt <- 
  as.data.table(do.call(rbind, Reduce(fun, numeric(9L), accumulate = T, init = 1:3)))

setnames(dt, c('a', 'b', 'c'))

dt
#      a  b          c
#  1:  1  2  3.0000000
#  2:  3 -1  0.0000000
#  3:  2 -3 -0.6666667
#  4: -1 -2 -0.1666667
#  5: -3  1 -3.1666667
#  6: -2  3 -3.8333333
#  7:  1  2 -3.3333333
#  8:  3 -1 -6.3333333
#  9:  2 -3 -7.0000000
# 10: -1 -2 -6.5000000

You can use transpose instead of do.call(rbind, as below, but if you have tidyverse or purrr loaded, make sure transpose is data.table::transpose

dt <- 
  as.data.table(transpose(Reduce(fun, numeric(9L), accumulate = T, init = 1:3)))

Explanation for junk:

Each iteration, Reduce passes the previous output (or init) as well as the i-th element of its x argument, to f. So even if you're not going to use the x argument of Reduce in your function f you still need to have an argument for it. If you don't add this extra "junk" argument, you get an "unused argument" error when it runs because it tries to add the extra argument to f, but f only has one argument.

very interesting use of Reduce, which I never thought of! Is there a specific reason why you added 'junk' as an argument? — Jinglestar, May 28 '19 at 16:29

score 2 · Accepted Answer · answered May 28 '19 at 23:50

2

Another option:

cols <- c('a','b','c')
A <- 1; B <- 2; C <- 3
DT[iter==1, (cols) := .(A, B, C)]
DT[iter>1, 
    (cols) := {
        A = A + B
        B = B - A
        C = A / B + C
        .(A, B, C)
    },
    by=iter]

answered May 28 '19 at 23:50

chinsoon12

25,005
4
25
35

This is magic! Can you explain how does DT understand to use A, B and C from the previous row rather than the ones in the parent environment? – Jinglestar May 29 '19 at 11:13
sorry just saw this. my hypothesis is that the `by` loops through each value like within a `for` loop and it is thus able to access and update variables in the parent frame. I can only experiment and show it but I don't know the exact code in `data.table` to prove it conclusively – chinsoon12 May 30 '19 at 10:21

GKi · Answer 4 · 2019-05-29T07:08:42.037

In fact you can solve your problem by using a recursive function call where you propagate your values from function call to function call and don't need to use the values of the previous row. In base you could do it like:

DT = data.frame(a = numeric(10L), b = numeric(10L), c = numeric(10L), iter = 1:10)

fun <- function(a, b, c, n) {
  a <- a + b
  b <- b - a
  c <- a/b + c
  n <- n - 1
  if(n<=0) {return(c(a,b,c))}
  return(rbind(c(a,b,c),fun(a,b,c,n)))
}

DT[1,1:3] <- 1:3
DT[-1,1:3] <- fun(DT[1,1], DT[1,2], DT[1,3], 9)
DT

    a  b          c iter
1   1  2  3.0000000    1
2   3 -1  0.0000000    2
3   2 -3 -0.6666667    3
4  -1 -2 -0.1666667    4
5  -3  1 -3.1666667    5
6  -2  3 -3.8333333    6
7   1  2 -3.3333333    7
8   3 -1 -6.3333333    8
9   2 -3 -7.0000000    9
10 -1 -2 -6.5000000   10

Alternatively you can simply make a for loop:

DT = data.frame(a = numeric(10L), b = numeric(10L), c = numeric(10L), iter = 1:10)
a <- 1
b <- 2
c <- 3
for(i in seq_len(nrow(DT))) {
  DT[i,1:3] <- c(a,b,c)
  a <- a + b
  b <- b - a
  c <- a/b + c
}
DT

    a  b          c iter
1   1  2  3.0000000    1
2   3 -1  0.0000000    2
3   2 -3 -0.6666667    3
4  -1 -2 -0.1666667    4
5  -3  1 -3.1666667    5
6  -2  3 -3.8333333    6
7   1  2 -3.3333333    7
8   3 -1 -6.3333333    8
9   2 -3 -7.0000000    9
10 -1 -2 -6.5000000   10

But this will also be slow. A fast solution is given e.g. by IceCreamToucan.

Recursion is generally slow without memoization; see https://stackoverflow.com/questions/6807068/why-is-my-recursive-function-so-slow-in-r — Frank, May 28 '19 at 14:14

R data.table values depending on previous row

4 Answers4