Efficient way to create recursive out-of-sample estimates in order to calculate RMSE in R

Question

I am trying to evaluate the out-of-sample forecasting performance of different OLS models. The easiest time-series regression looks like this: Y_t = b0 + b1 * Y_t-30 + e_t

The fitting period for the model is, let's say 50, then I let the model run using the dynlm package

dynlm(as.zoo(Y) ~ L(as.zoo(Y), 30), start = "1996-01-01", end = timelist[i])

In my current code I just let the index i run until the end and then I save the RMSE of the corresponding model. But this RMSE is not the out-of-sample one step ahead forecast and since my current code is already pretty slow and it doesn't even do exactly what I want it to do, I wanted to ask you if you have a suggestion which package I should use to achieve my goal.

To sum it up, I want to do the following:

1) run a recursive regression after a certain fitting period (expanding window, NOT rolling window)

2) create one-step ahead out-of-sample forecasts

3) calculate the root mean squared error of these forecasts vs. actual observations to evaluate the model performance

I tried doing this so far with a huge for-loop and the dynlm package, but the results are not really satisfying. Any input is greatly appreciated, since I have been looking for solutions for quite a while now. I will update my example code as soon as I made some progress.

# minimal working example
require(xts)
require(zoo)
require(dynlm)
timelist <- seq.Date(from = as.Date("1996-01-01"), to = as.Date("1998-01-01"), by = "days")
set.seed(123)
Y <- xts(rnorm(n = length(timelist)), order.by = timelist)
X <- xts(rnorm(n = length(timelist), mean = 10), order.by = timelist)
# rmse container
rmse.container.full <- data.frame(matrix(NA, ncol = 3, nrow = length(index(timelist))))
colnames(rmse.container.full) <- c("Date", "i", "rmse.m1")
rmse.container.full$Date <- timelist
# fitting period
for(i in 50:length(timelist)) {
  # m1 
  model1 <- dynlm(as.zoo(Y) ~ L(as.zoo(X), 30), start = "1996-01-01", end = timelist[i])
  rmse.container.full[i, 2] <- i
  rmse.container.full[i, 3] <- summary(model1)$sigma # RSME mod1 etc
  print(i)
}

Welcome to StackOverflow. Please take a look at these tips on how to produce a [minimum, complete, and verifiable example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Perhaps the following tips on [asking a good question](http://stackoverflow.com/help/how-to-ask) may also be worth a read. — lmo, Jun 24 '16 at 16:54
Why exactly aren't the results "very satisfying"? That's not a specific problem that we can address. If you need help with statistical modeling, you should ask your question at [stats.se], otherwise make it clear what the programming question is here and include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — MrFlick, Jun 24 '16 at 16:59
Thank you for your suggestion @MrFlick, I will have a look at this now: [link](https://www.otexts.org/fpp) and also check Cross Validated. — tester, Jun 24 '16 at 17:11

score 1 · Accepted Answer · edited May 23 '17 at 11:58

Well, as the one who asked the question I would like to contribute how I solved my issue:

As I only need the one step ahead forecasts I can throw away everything else and this made the code run way faster. (from 12 minutes to ~ 10 seconds per model).

I created the full dataframe(including lags) myself and used lm instead of dynlm. The following code gave me my desired results (I checked the first few observations manually and it seems to work). The code is adapted from here: Recursive regression in R

      mod1.predictions <- lapply( seq(1400, nrow(df.full)-1), function(x) {
                mod1 <- lm(Y ~ X, data = df.full[1:x, ])
                pred1 <- predict(mod1, newdata = df.full[x+1, ])
                return(pred1)
              })

For computing the RMSE I used this function

# rmse function
rmse <- function(sim, obs) {
  res <- sqrt( mean( (sim - obs)^2, na.rm = TRUE) )
  res
}

Thanks for the hints to CrossValidated, it helped a lot.

Benjamin Christoffersen · Answer 2 · 2018-07-02T17:32:24.027

Update

You can do the out-of-sample-forecasts with the rollRegres package I have written as follows (it is faster than the previous solution)

# simulate data
set.seed(101)
n <- 1000
X <- rnorm(n)
y <- 10 - X + rnorm(n)
dat <- data.frame(y = y, X)

# define wrapper to get out-of-sample predicted values
library(rollRegres)
wrapper <- function(formula, data, min_window_size){
  out <- roll_regres(
    formula = frm, data = data, width = min_window_size, do_downdates = FALSE,
    do_compute = "1_step_forecasts")$one_step_forecasts

  out[!is.na(out)]
}

# assign function to compare with
func <- function(formula, data, min_window_size){
  sapply(seq(min_window_size, nrow(data) - 1L), function(x) {
    mod1 <- lm(formula, data = data[1:x, ])
    pred1 <- predict(mod1, newdata = data[x+1, ])
    pred1
  })
}

# show that the two gives the same
frm <- y ~ X
r1 <- wrapper(frm, dat, 25L)
r2 <- func   (frm, dat, 25L)
all.equal(r1, r2, check.attributes = FALSE)
#R> [1] TRUE

# show run time
microbenchmark::microbenchmark(
  func = func(frm, dat, 25L), roll_regres = wrapper(frm, dat, 25L),
  times = 5)
#R> Unit: microseconds
#R>        expr         min          lq        mean      median          uq         max neval
#R>        func 1027213.048 1028723.765 1050103.171 1034833.792 1038513.793 1121231.455     5
#R> roll_regres     560.198     569.284    1073.778     610.766     636.445    2992.198     5

Notice the ~ 560 / 1028700 relative computation time. Then you can compute the RMSE using the predicted values as in your own answer.

Old answer

You can decrease the computation time quite a lot by using the Fortran functions from

Miller, A. J. (1992). Algorithm AS 274: Least squares routines to supplement those of Gentleman. Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(2), 458-478.

You can do so by using this code

# simulate data
set.seed(101)
n <- 1000
X <- matrix(rnorm(10 * n), n, 10)
y <- drop(10 + X %*% runif(10)) + rnorm(n)
dat <- data.frame(y = y, X)

# assign wrapper for biglm
biglm_wrapper <- function(formula, data, min_window_size){
  mf <- model.frame(formula, data)
  X <- model.matrix(terms(mf), mf)
  y - model.response(mf)

  n <- nrow(X)
  p <- ncol(X)
  storage.mode(X) <- "double"
  storage.mode(y) <- "double"
  w <- 1
  qr <- list(
    d = numeric(p), rbar = numeric(choose(p, 2)), 
    thetab = numeric(p), sserr = 0, checked = FALSE, tol = numeric(p))
  nrbar = length(qr$rbar)
  beta. <- numeric(p)

  out <- numeric(n - min_window_size - 2)
  for(i in 1:(n - 1)){
    row <- X[i, ] # will be over written
    qr[c("d", "rbar", "thetab", "sserr")] <- .Fortran(
      "INCLUD", np = p, nrbar = nrbar, weight = w, xrow = row, yelem = y[i], 
      d = qr$d, rbar = qr$rbar, thetab = qr$thetab, sserr = qr$sserr, ier = i - 0L, 
      PACKAGE = "biglm")[
        c("d", "rbar", "thetab", "sserr")]

    if(i >= min_window_size){
      coef. <- .Fortran(
        "REGCF", np = p, nrbar = nrbar, d = qr$d, rbar = qr$rbar,
        thetab = qr$thetab, tol = qr$tol, beta = beta., nreq = p, ier = i,
        PACKAGE = "biglm")[["beta"]]
      out[i - min_window_size + 1] <- coef. %*%  X[i + 1, ]
    }
  }

  out
}

# assign function to compare with
func <- function(formula, data, min_window_size){
  sapply(seq(min_window_size, nrow(data)-1), function(x) {
    mod1 <- lm(formula, data = data[1:x, ])
    pred1 <- predict(mod1, newdata = data[x+1, ])
    pred1
  })
}

# show that the two gives the same
frm <- y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
r1 <- biglm_wrapper(frm, dat, 25)
r2 <- func(frm, dat, 25)
all.equal(r1, r2, check.attributes = FALSE)
#R> [1] TRUE

# show run time
microbenchmark::microbenchmark(
  r1 = biglm_wrapper(frm, dat, 25), r2 = f2(frm, dat, 25), 
  times = 5)
#R> Unit: milliseconds
#R>  expr         min         lq       mean     median         uq        max neval cld
#R>    r1    9.976505   10.00653   11.85052   10.53157   13.36377   15.37424     5  a 
#R>    r2 1095.944410 1098.29661 1122.17101 1098.58264 1113.48833 1204.54306     5   b

Efficient way to create recursive out-of-sample estimates in order to calculate RMSE in R

2 Answers2

Update

Old answer