How to speed up a for-loop when order of execution matters?

Question

I have a table containing historical daily price data for different products and NA for future prices. I have a column for the expected price increase on a given future date for a product. The price increase is based off of prior day price.

I have constructed a for-loop to calculate the expected price for the products, but it runs very slowly for the ~500,000 records it is looping through.

All historical price data is in the table while all projected price is NA.

Example of current table (old_table):

date        product        price        incr_amt
====================================================
...          ...            ...         ...
10/14/19     prod1          50          1.0
10/15/19     prod1          50          1.0
10/16/19     prod1          NA          1.0
...          ...            ...         ...
04/01/20     prod1          NA          1.05
04/02/20     prod1          NA          1.0
...          ...            ...         ...
...          ...            ...         ...
10/14/19     prod2          35          1.0
10/15/19     prod2          35          1.0
10/16/19     prod2          NA          1.0
...          ...            ...         ...
01/01/20     prod2          NA          1.02
01/02/20     prod2          NA          1.0
...          ...            ...         ...

My current code groups by product, then if price is NA then calculate price as lagged price * increase_amt. Then recalculate lagged_price for next iteration. Loop through until all rows in table.

Example result (new_table):

date        product        price        incr_amt
====================================================
...          ...            ...         ...
10/14/19     prod1          50          1.0
10/15/19     prod1          50          1.0
10/16/19     prod1          50          1.0
...          ...            ...         ...
04/01/20     prod1          52.5        1.05
04/02/20     prod1          52.5        1.0
...          ...            ...         ...
...          ...            ...         ...
10/14/19     prod2          35          1.0
10/15/19     prod2          35          1.0
10/16/19     prod2          35          1.0
...          ...            ...         ...
01/01/20     prod2          35.7        1.02
01/02/20     prod2          35.7        1.0
...          ...            ...         ...

My current code works, but it takes over an hour to run. Because each iteration is dependent upon the previous and order matters, I don't know if there is a work around using a loop.

Current Code:

library(tidyverse)

old_table <- tribble(
  ~date, ~product, ~price, ~incr_amt,
  "2019-10-14", "prod1", 50, 1.0,
  "2019-10-15", "prod1", 50, 1.0,
  "2019-10-16", "prod1", NA, 1.0,
  "2019-10-17", "prod1", NA, 1.0,
  "2019-10-18", "prod1", NA, 1.0,
  "2019-10-19", "prod1", NA, 1.05,
  "2019-10-20", "prod1", NA, 1.0,
  "2019-10-21", "prod1", NA, 1.0,
  "2019-10-14", "prod2", 35, 1.0,
  "2019-10-15", "prod2", 35, 1.0,
  "2019-10-16", "prod2", NA, 1.0,
  "2019-10-17", "prod2", NA, 1.0,
  "2019-10-18", "prod2", NA, 1.0,
  "2019-10-19", "prod2", NA, 1.0,
  "2019-10-20", "prod2", NA, 1.0,
  "2019-10-21", "prod2", NA, 1.02,
  "2019-10-22", "prod2", NA, 1.0
)

new_table <- old_table %>%
  group_by(product) %>%
  mutate(lag_price = lag(price))

for (i in 1:nrow(new_table)) {
  if (!is.na(new_table$price[[i]]))
    next
  if (is.na(new_table$price[[i]])) {
    new_table$price[[i]] = new_table$lag_price[[i]] * new_table$incr_amt[[i]]
    new_table$lag_price <- lag(new_table$price)
  }

}

The code runs, but takes over an hour to loop through the ~500,000 records. How can I improve this process? Thanks.

Possible duplicate of [Speed up the loop operation in R](https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) — yusuzech, Oct 15 '19 at 17:53
Can you please make your reproducible by including some example data and output? (Your example tables are better than nothing, but I can't load those into R as is -- there are lots of missing rows where I'd need to infer what you want.) — Jon Spring, Oct 15 '19 at 18:14
I have updated the original post to include a sample table. I hope that clarifies. Thank you. — Blake Lynch, Oct 15 '19 at 18:36
At least part of what is making your loop slow is that you are modifying the table 500,000 times, which amounts to copying the table to a new place in memory, editing it, and updating the pointer. See [the R inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf#page=13). — alan ocallaghan, Oct 15 '19 at 18:43

Jon Spring · Answer 1 · 2019-10-15T18:51:10.663

Here's a vectorized solution that I expect will be much faster. (I'd be curious how much faster on your real data.) The main thing slowing your code down is, as @aocall notes, the 500,000 table modifications. It should be much faster if we can apply the same calculations to the whole table at once. Here, we calculate the cumulative growth across each missing section within each product. (We also unnecessarily calculate the growth across non-missing sections but I assume the overhead will be minimal.) Then we can apply that growth factor to the last available number to get the filled-in one.

library(dplyr)
new_table2 <- old_table %>%
  # Put together strings of missingness & track cumulative growth in them
  group_by(product) %>%
  mutate(missing_streak = cumsum(is.na(price) != is.na(lag(price)))) %>%

  # fill in NA with last value
  mutate(price_new = price) %>%
  tidyr::fill(price_new) %>%

  # gross up based on growth
  group_by(product, missing_streak) %>%
  mutate(cuml_growth = cumprod(incr_amt)) %>%
  mutate(price_new = if_else(is.na(price),
                             price_new * cuml_growth,
                             price)) %>%
  ungroup()

Seems to work on your data:

identical(new_table$price, new_table2$price_new)
[1] TRUE

How to speed up a for-loop when order of execution matters?

1 Answers1