1

I'm desperately trying to avoid for loops to calculate custom financial indicators (multiple stocks, 5,000 rows per stock). I'm trying to use purrr::map2, and it is fine when doing math on existing vectors, but I need to reference the lag (previous) value of the vector I'm trying to create. Without referencing a previous value, purrr::map2 works fine:

some_function <- function(a, b) {   (a * b) + ((1 - a) * b)  }
a <- c(0.019, 0.026, 0.012, 0.022)  # some indicator
b <- c(15.5, 16.7, 14.8, 13.1)  # close price
purrr::map2(a, b, some_function)

which just results in the original close values

15.5, 16.7, 14.8, 13.1

But what I'm really trying to do is create a new vector (c), that looks back on itself (lag) as part of the calculation. If it is the first row, c == b, otherwise:

desired_function <- function(a, b, c) {   (a * b) + ((1 - a) * lag(c))  } 

So I create a vector c and populate and try:

c <- c(15.5, 0, 0, 0)
purrr::map2(a, b, c, desired_function)

And get all NULL values, obviously.
Values for c should be: 15.50, 15.53, 15.52, 15.47

Referencing a previous value is a common thing among indicators, and it forces me to go to clunky, slow 'for loops'. Any suggestions are greatly appreciated.

user20650
  • 24,654
  • 5
  • 56
  • 91
Dan Hill
  • 11
  • 1
  • 1
    try `purrr::pmap(list(a, b, c), desired_function)` – guasi Jul 03 '22 at 00:20
  • That syntax works, but my function isn't set up right to "look back" on the previous value of the created variable. Thanks for the proper syntax. – Dan Hill Jul 03 '22 at 15:23
  • I've added a small experiment to my answer which implies purrr might not be so fast after all. – Caspar V. Jul 03 '22 at 21:59

2 Answers2

0

If calculating a certain value in a vector requires another value from the same vector, then it just can't be vectorized; you'll have to calculate them one after another.

For loops aren't slow by themselves; it's how you use them. For instance, retrieving values from a data frame one value at a time, or inserting them one value at a time, is a common practice that is very slow.

The implementation of for-loops in R has improved a lot in the past 10 years, alledgedly they used to be less efficient, and in older posts you'll find many people complaining about it.

Recommended reading:

https://www.r-bloggers.com/2018/06/why-loops-are-slow-in-r/

And these two old questions (well, their answers):

Speed up the loop operation in R

Why are loops slow in R?

A little experiment

Let's benchmark the simplest (dumbest?) for-loop implementation with purrr::map() for a function without lag: c = a*b + (1-a) * b

On this benchmark with 10 million items, the for-loop was over 15 times faster than purrr::map2().

# functions ---------------------------------------------------------------

desired_function <- function(a,b) { a*b + (1-a) * b }

des_fnc_for <- function(a, b) {
  c <- numeric(length(a))
  c[1] <- b[1]
  for(i in seq_along(a)) c[i] <- a[i] * b[i] + (1 - a[i]) * b[i]
  return(c)
}


# verify --------------------------------------------------------------------

a <- c(0.019, 0.026, 0.012, 0.022)  # some indicator
b <- c(15.5, 16.7, 14.8, 13.1)  # close price

unlist(purrr::map2(a,b,desired_function))

[1] 15.5 16.7 14.8 13.1

des_fnc_for(a,b)

[1] 15.5 16.7 14.8 13.1


# benchmark ---------------------------------------------------------------

a <- runif(10000000, 0.01, 0.03)
b <- runif(10000000, 13, 17)

system.time( des_fnc_for(a,b) )

   user  system elapsed 
  1.143   0.007   1.163 

system.time( purrr::map2(a,b,desired_function) )

   user  system elapsed 
 18.570   0.627  19.761 
Caspar V.
  • 1,782
  • 1
  • 3
  • 16
  • wow, that's quite a difference, Caspar. I will re-examine my for loops. I must have something wonky slowing it down. – Dan Hill Jul 03 '22 at 22:40
0

Here some solutions, first one refers to your idea using stats::lag (using stats::, because the dplyr package always masks lag!),

r <- numeric(4L)
for (i in 1:4) {
  r[i] <- c[i + 1] <- a[i]*b[i] + (1 - a[i])*stats::lag(c)[i]
}
r
# [1] 15.50000 15.53120 15.52243 15.46913

and another one using a starting value that updates in every iteration, which is about 20% faster.

r <- numeric(4L)
sval <- 15.5
for (i in 1:4) {
  r[i] <- sval <- a[i]*b[i] + (1 - a[i])*sval
}
r
# [1] 15.50000 15.53120 15.52243 15.46913

Data:

a <- c(0.019, 0.026, 0.012, 0.022)
b <- c(15.5, 16.7, 14.8, 13.1)
c <- c(15.5, 0, 0, 0)
jay.sf
  • 60,139
  • 8
  • 53
  • 110