R: Rolling calculation of column values (avoid loop)

Question

I want to incrementally grow a new column, based on values of the previous row & same column. You could do it with a loop, like so:

df <- data.frame(a = 2000:2010,
                 b = 10:20,
                 c = seq(1000, 11000, 1000),
                 x = 1000)
for(i in 2:nrow(df)) df$x[i] <- (df$c[i]) * df$a[i-1] / df$x[i-1] + df$b[i] * df$a[i]
df
      a  b     c        x
1  2000 10  1000  1000.00
2  2001 11  2000 26011.00
3  2002 12  3000 24254.79
4  2003 13  4000 26369.16
5  2004 14  5000 28435.80
6  2005 15  6000 30497.85
7  2006 16  7000 32556.20
8  2007 17  8000 34611.93
9  2008 18  9000 36665.87
10 2009 19 10000 38718.65
11 2010 20 11000 40770.76

(As you see, new values in column x use values of column x of the previous row.)

However, as I do this for a Shiny app, I need to have fast calculation, thus using loops is not optimal. Is there a way of doing this which avoids loops, ideally making use of dplyr's piping? This reply (Referring to previous row in calculation) suggests a way using sapply - however, I am unable to do this mathematically...

Moving parts of your calculation outside of the loop should speed things up quite a bit, regardless of the looping construct you choose. — , Jan 01 '20 at 10:28

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

There are a few options.

Use vectors

During each loop, it's expensive to do df$x because it takes memory to do it. Instead, you can pre-assign vectors and subset the vectors.

#easiest - extract the vectors before the loop
C <- df[['c']] #used big C because c() is a function
a <- df[['a']]
b <- df[['b']]
x <- df[['x']]

for(i in seq_along(x)[-1]) x[i] <- C[i] * a[i-1] / x[i-1L] + b[i] * a[i]

Use a function

Turning your loop into a function will improve performance due to the optimization from compiling.

f_recurse = function(a, b, C, x){
  for (i in seq_along(x)[-1]) x[i] <- C[i] * a[i-1] / x[i-1L] + b[i] * a[i]
  x
}

f_recurse(df$a, df$b, df$c, df$x)

Use Rcpp

Finally, if the response is still too laggy, you can try to use Rcpp. Note, Rcpp updates in place so while I return a vector, there's really no need - the df$x has also been updated.

library(Rcpp)
cppFunction('
NumericVector f_recurse_rcpp(IntegerVector a, IntegerVector b, NumericVector C, NumericVector x){
for (int i = 1; i < x.size(); ++i){
 x[i] = C[i] * a[i-1] / x[i - 1] + b[i] * a[i];
}
return(x);
}
')

f_recurse_rcpp(df$a, df$b, df$c, df$x)

Performance

In all, we get close to a 1,000 times performance increase. The table below is from bench::mark which also checks for equality.

# A tibble: 4 x 13
  expression                                 min  median `itr/sec` mem_alloc
  <bch:expr>                             <bch:t> <bch:t>     <dbl> <bch:byt>
1 OP                                      8.27ms   8.8ms      106.   62.04KB
2 extract                                 6.21ms  7.49ms      126.   46.16KB
3 f_recurse(df$a, df$b, df$c, df$x)       13.1us  28.8us    33295.        0B
4 f_recurse_rcpp(df$a, df$b, df$c, df$x)   8.6us    10us    98240.    2.49KB

And here's an example with a 1,000 row data.frame and then 10,000 row

df <- data.frame(a = sample(1000L),
                 b = sample(1001:2000),
                 c = seq(1000, 11000, length.out = 1000),
                 x = rep(3, 1000L))

# A tibble: 4 x 13
  expression                                 min   median `itr/sec` mem_alloc
  <bch:expr>                             <bch:t> <bch:tm>     <dbl> <bch:byt>
1 OP                                      23.9ms  24.38ms      39.4    7.73MB
2 extract                                  6.5ms   7.71ms     123.    69.84KB
3 f_recurse(df$a, df$b, df$c, df$x)      265.7us  271.9us    3596.    23.68KB
4 f_recurse_rcpp(df$a, df$b, df$c, df$x)  17.4us   18.9us   51845.     2.49KB

df <- data.frame(a = sample(10000L),
                 b = sample(10001:20000),
                 c = seq(1000, 11000, length.out = 10000),
                 x = rep(3, 10000L))

# A tibble: 4 x 13
  expression                                  min   median `itr/sec` mem_alloc
  <bch:expr>                             <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 OP                                     353.17ms 412.62ms      2.42  763.38MB
2 extract                                  8.75ms   8.95ms    107.    280.77KB
3 f_recurse(df$a, df$b, df$c, df$x)        2.58ms   2.61ms    376.    234.62KB
4 f_recurse_rcpp(df$a, df$b, df$c, df$x)   98.6us  112.7us   8169.      2.49KB

Great answer! I worked my brain out but can not solve this question.Could you tell me how did you learn that ` loop into a function will improve performance` and the `seq_along` methods? From any book or resources? — Travis, Jan 01 '20 at 12:45
The ```seq_along``` isn't anything special - but I have seen enough posts here to know users sometimes prefer it to ```2:length(x)``` in cases where the length of x is 0. ```seq_along(x)[-1]``` wouldn't produce an error. As for the second, see this classic post: https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r — Cole, Jan 01 '20 at 12:48

R: Rolling calculation of column values (avoid loop)

1 Answers1

Use vectors

Use a function

Use Rcpp

Performance