Make for-loop within a function faster in R

Question

Sample data

set.seed(123)
df <- data.frame(day = 1:365, Precp = sample(1:30, 365, replace = T), 
  ETo = sample(1:10,  365, replace = T), top.FC = 23, CN = 61, DC = 0.4)

This data has day of the year, rainfall and evapotranspiration & some constants like top.FC, CN and DC. For a given day i, the water.update function calculates the Soil water for day i

water.update <- function(WAT0, RAIN.i, ETo.i, CN, DC, top.FC){ 

S = 25400/CN - 254;  IA = 0.2*S

if (RAIN.i > IA) { RO = (RAIN.i - 0.2 * S)^2/(RAIN.i + 0.8 * S)
} else { 
RO = 0 
}

if (WAT0 + RAIN.i - RO > top.FC) { 
DR = DC * (WAT0 + RAIN.i - RO - top.FC) 
} else { 
DR = 0 
}    
dWAT = RAIN.i - RO - DR - ETo.i
WAT1 = WAT0 + dWAT
WAT1 <- ifelse(WAT1 < 0, 0, WAT1) 
return(list(WAT1,RO,DR))
}

The function water.model applies the water.update for all days. It's recursive i.e. each days soil water needs previous day's soil water. Hence the loop in the water.model function.

water.model <- function(dat){

 top.FC  <- unique(dat$top.FC)    

 # I make a vector to store the results 
 dat$WAT <- -9999.9
 dat$RO <- -9999.9
 dat$DR <- -9999.9

# First day (day 1) has a default value
dat$WAT[1] <- top.FC/2 # assuming top soil water is half the content on day 1   
dat$RO[1] <- NA 
dat$DR[1] <- NA

# Now calculate water content for day 2 onwards  

for(d in 1:(nrow(dat)-1)){

 dat[d + 1,7:9] <- water.update(WAT0 = dat$WAT[d], 
                                 RAIN.i = dat$Precp[d + 1], 
                                 ETo.i = dat$ETo[d + 1], 
                                 CN = unique(dat$CN), 
                                 DC = unique(dat$DC),
                                 top.FC = unique(dat$top.FC))
 }
 return(dat)
}


 ptm <- proc.time()
 result <- water.model(df)
 proc.time() - ptm

    user  system elapsed 
    0.18    0.00    0.17

The for loop is inevitable in this case since it uses previous day's water content to determine the water content of the present day.

Is there faster way to write the above function? I am looking for speeding up this code. The reason is because my actual data is much bigger.

I answered a [very similar question in the past here](https://stackoverflow.com/questions/49118364/r-dplyr-solution-for-for-loop-with-initial-conditions-set/49141354#49141354) using `Rcpp`, I would recommend considering a similar strategy here. — Matt Summersgill, May 21 '18 at 18:21
Are you sure the for loop is inevitable? You can't be clever with `cumsum` and `cumprod`? (Hard to tell without seeing an equation.) A couple obvious slow parts: calculate your `unique(dat$*)` inputs once, before the loop, not repeatedly for every iteration inside the loop (won't help much, but on principle do everything you can outside of the loop). — Gregor Thomas, May 21 '18 at 18:23
@MattSummersgill that's why I added the tag Rcpp since it is quite fast and I am looking to implement similar stuff based on what you provided. However, my knowledge of C++ is quite bad and hence the question — 89_Simple, May 21 '18 at 18:32
@Gregor the reason why I said for-loop is inevitatble is because each day's water content is calculated by adding previous day's water content + rainfall - ETo and some other stuff. — 89_Simple, May 21 '18 at 18:34
Yes, I understand there is a dependency on the previous day, but I can't see exactly what it is. Lacking comments, a formula, and clear variable names, it is hard to follow your code to see if a loop is really needed. From your description in the above comment, if the rainfall and ETo are all known in advance, then something like `water_content = cumsum(rainfall - ETo)` would get you there without a loop. — Gregor Thomas, May 21 '18 at 18:39
@Crop89, I have 2 minor suggestions that do improve performance, but not as much as an `Rcpp` solution would. First, in your `water.update` function, vectorize your `if` statements, i.e. something like `RO = (RAIN.i > IA)*(RAIN.i - 0.2 * S)^2/(RAIN.i + 0.8 * S)`, and `DR = (WAT0 + RAIN.i - RO > top.FC)*(DC * (WAT0 + RAIN.i - RO - top.FC))`. Second, in the `water.model` function, create `CN`, `DC`, and `top.FC` before the loop, in which case you avoid 1089 extra calculations (3*(nrow(dat)-1) - 3) — Yannis Vassiliadis, May 21 '18 at 18:42
Also, since you're using `d` once and `d+1` 3 times inside the loop, you will save some time by switching to `d-1` and `d` respectively. In this case, the loop will be `for(d in 2:nrow(dat))`. So, in this case you replace 3 additions with 1 subtraction, again saving some time. — Yannis Vassiliadis, May 21 '18 at 18:48
@DirkEddelbuettel I'm adding the `Rcpp` tag back now that I've provided an `Rcpp` based answer. — Matt Summersgill, May 21 '18 at 21:59
Sure, very kind of you to write code for the OP. Should that have been closed as a dupe though? — Dirk Eddelbuettel, May 21 '18 at 22:19
I don't think it really meets a _strict_ definition of duplicate -- It's at least structured differently enough that I didn't realize that I was linking to an answer to another one of the OP's question's initially. — Matt Summersgill, May 21 '18 at 22:34

score 1 · Accepted Answer · answered May 21 '18 at 21:55

Using Rcpp and data.table. The code below runs, but I'm getting slightly different results than the R code you provided. I suspect it has to do with the way I interpreted which indices you were using to lag/lead various columns, but without the domain knowledge of what these things represent I'm having a hard time intuitively understanding what the correct logic should be. Hopefully this is a decent starting point!

Create a separate file named WaterModel.cpp with the following contents:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]

List WaterModel(NumericVector RAIN,
                NumericVector ETo,
                double CN,
                double DC,
                double topFC) {

  double S = 25400/CN - 254;
  double IA = 0.2*S;

  int n = RAIN.length();
  NumericVector WAT(n);
  NumericVector RO(n);
  NumericVector DR(n);

  WAT[0] = topFC/2;

  for (int i = 1; i < n; i++) {

    if (RAIN[i] > IA) {
      RO[i] = pow((RAIN[i-1] - 0.2 * S),2) / (RAIN[i-1] + 0.8 * S);
    } else { 
      RO[i] = 0;
    }

    if (WAT[i-1] + RAIN[i-1] - RO[i-1] > topFC) { 
      DR[i] = DC * (WAT[i-1] + RAIN[i-1] - RO[i-1] - topFC) ;
    } else { 
      DR[i] = 0 ;
    } 

    WAT[i] = WAT[i-1] + RAIN[i-1] - RO[i-1] - DR[i-1] - ETo[i-1];

    if (WAT[i] < 0){
      WAT[i] = 0;
    }

  }
    return Rcpp::List::create(Rcpp::Named("WAT") = WAT,
                              Rcpp::Named("RO") = RO,
                              Rcpp::Named("DR") = DR);

}

Then use a Rcpp::sourceCpp() to source it. You can then keep your constants outside of the data.table and store them as single values instead of repeating them for every single row. This saves us from needing to allocate full vectors in the C++ function when all we really need is a single double value, and should save some time/memory.

library(data.table)
library(Rcpp)

set.seed(123)
DT <- data.table(day = 1:365,
                 Precp = sample(1:30, 365, replace = T), 
                 ETo = sample(1:10,  365, replace = T))

## Don't make constant columns just to store constants
Const_topFC = 23
Const_CN = 61
Const_DC = 0.4

Rcpp::sourceCpp("WaterModel.cpp")

DT[,c("WAT","RO","DR"):= WaterModel(Precp,ETo,Const_CN,Const_DC,Const_topFC)]

DT
#       day Precp ETo     WAT  RO        DR
#   1:   1     9   8 11.50000  0  0.0000000
#   2:   2    24   2 12.50000  0  0.0000000
#   3:   3    13   1 34.50000  0  5.4000000
#   4:   4    27   5 41.10000  0  9.8000000
#   5:   5    29   5 53.30000  0 18.0400000
# ---                                     
# 361: 361     5   8 30.10327  0  8.6166592
# 362: 362     6   9 18.48661  0  4.8413088
# 363: 363    27   7 10.64530  0  0.5946452
# 364: 364    10   8 30.05066  0  5.8581216
# 365: 365    11   1 26.19254  0  6.8202636

Thank you Matt for another example. Now I understand the gist of writing Rcpp. — 89_Simple, May 22 '18 at 08:48

Make for-loop within a function faster in R

1 Answers1