6

I am trying to calculate one-step-ahead forecasts using the so called MIDAS concept. Within this concept one calculates forecasts in dependence of a higher-frequency data. For example, the dependent variable y could be yearly recorded and be explained with the help of an independent variable x, which could be sampled, for example, quarterly.

There is a package called midasr which offers a lot of functions. I can calculate the one-step-ahead forecasts using the function select_and_forecast from the mentioned package as follows (with simulated data, which is a simplified version of the example form the user's guide to the package midasr):

Generation of the data:

library(midasr)
set.seed(1001)
n <- 250
trend <- c(1:n)
x <- rnorm(4 * n)
z <- rnorm(12 * n)
fn.x <- nealmon(p = c(1, -0.5), d = 8)
y <- 2 + 0.1 * trend + mls(x, 0:7, 4) %*% fn.x + rnorm(n)

Calculation of forecasts (out-of-sample forecast horizon is controlled by the argument outsample, so in my example I am calculating 10 forecasts, from 240 to 250)

select_and_forecast(y~trend+mls(y,1,1,"*")+mls(x,0,4),
                          from=list(x=c(4)),
                          to=list(x=rbind(c(14,19))),
                          insample=1:250,outsample=240:250,
                          weights=list(x=c("nealmon","almonp")),
                          wstart=list(nealmon=rep(1,3),almonp=rep(1,3)),
                          IC="AIC",
                          seltype="restricted",
                          ftype="recursive",
                          measures=c("MSE"),
                          fweights=c("EW","BICW")
)$forecasts[[1]]$forecast

What I would like to do now is to simulate a situation where a new value of the higher-frequency variable becomes available, because, for example, a new month has passed and the value for this month can be used in the model. I would proceed as follows, but am very unsure if it is correct:

select_and_forecast(y~trend+mls(y,1,1,"*")+mls(x,0,4),
                          from=list(x=c(3)),   # The only change is the reduction of the lower bound of the range of the lags of the regeressor from 4 to 3
                          to=list(x=rbind(c(14,19))),
                          insample=1:250,outsample=240:250,
                          weights=list(x=c("nealmon","almonp")),
                          wstart=list(nealmon=rep(1,3),almonp=rep(1,3)),
                          IC="AIC",
                          seltype="restricted",
                          ftype="recursive",
                          measures=c("MSE"),
                          fweights=c("EW","BICW")
)$forecasts[[1]]$forecast

Theoretically one includes the new observations of the higher-frequency variable through reduction of the time index, but I don't know if using the function this way is correct.

This question is for someone who is familiar with the package. Can someone give a comment to this?

The formula I think on is:

y_t=\beta_0 + \beta_1B(L^{1/m};\theta)x_{t-h+1/m}^{(m)} + \epsilon_t^{(m)}

With h=1 in my case and adding 1/m to include a new high-frequency observation

DatamineR
  • 10,428
  • 3
  • 25
  • 45

1 Answers1

4

I am not sure that I understood your question correctly so I will give an example which I hope will answer your question.

Suppose your response variable y is observed at a yearly frequency and the predictor variable x is observed quarterly (which corresponds to the simulated data). Say you are interested in forecasting next year y value using the data from the previous year. Then the model equation in the pacakge midasr is the following:

y~mls(x,4:7,4)

The values 4:7 are the lags of x used for prediction and 4 indicates that there are 4 observations of x for every observation of y.

The package midasr uses the convention, that for low frequency period t=l we observe high frequency periods m*(l-1)+1:m. So for year 1 we have the quarters 1,2,3,4, for year 2 we have the quarters 5,6,7,8. This convention then assumes that we observe y at year 1 together with the 4 quarter of x, y at year 2 together with quarter 8 of x and etc.

The MIDAS model is formulated in terms of lags, which start at zero. So if we want to explain y at year 1 (as in our example the low frequency is the yearly frequency) with the values of x from the same year, i.e. quarters 4,3,2,1 we use the lags 0,1,2,3. If our goal is to explain y at year 2 with values of x at year 1 the we use lags 4,5,6,7 which correspond to quarters 4,3,2,1.

Now assume the we are at year 3, but we have not observed yet the y value, but we have already observed the first quarter of the year 3, i.e., the quarter 9. Suppose we want to use this information for forecasting. Quarter 9 is three high frequency lags behind the the year 3, hence the model specification is now

y~mls(x,3:7,4)

where we also include all the information about the previous year too.

So if my example corresponds to what you are asking, then yes, inclusion of the new high frequency observation is only a matter of changing value of from argument the way you did. However I strongly suggest to start with one simple model to fully grasp the way the package works.

mpiktas
  • 11,258
  • 7
  • 44
  • 57
  • The second executin of `select_and_forecast` (with minimal lag of 3) suggested a model `y ~ trend + mls(y, 1, 1, "*") + mls(x, 3:16, 4, nealmon)` I then estimated it: `m<-midas_r(y ~ trend + mls(y, 1, 1, "*") + mls(x, 3:16, 4, nealmon),start=list(x=rep(1,3)))` `m MIDAS regression model model: y ~ trend + mls(y, 1, 1, "*") + mls(x, c(3:20), 4, nealmon) (Intercept) trend y x1 x2 x3 1.83424 0.09362 0.06864 0.12670 1.00000 1.00000` [notice that the coefficients are slightly different compared to suggested model, I suppose because of the optim]. – DatamineR Jan 15 '14 at 13:36
  • Then I tried to forecast `forecast(m,newdata=list(x=rep(NA,3),trend=251))` with the error message: `Error in mls(x, c(3:20), 4, nealmon) : Incomplete high frequency data` – DatamineR Jan 15 '14 at 13:37
  • First try different starting values. Lots of things depends on them, so you should at least try several. The forecast function produced an error, since you did not supply the first quarter value. If you are forecasting next year with first quarter present, you need to supply it. Now you supplied only 3 values, when forecast expected 4. Note argument `newdata` means new data, i.e. the data which was not available during the model estimation. – mpiktas Jan 15 '14 at 13:44
  • I suppose the differences result from the `fnscale` parameter, because in the first case it is: $opt$counts:function:90 gradient 23 and in the second case: $opt$counts: function: 115 gradient 22 – DatamineR Jan 15 '14 at 13:51
  • Where did you find `fnscale` parameter? I strongly suggest reading about optimisation algorithms and non-linear least squares problems. The MIDAS regression (with restrictions) is a non-linear model, hence it has a dependency on starting values, bad starting values can lead to a misleading fit. – mpiktas Jan 15 '14 at 14:14
  • I just added `$opt$counts:` to the suggested model by `select_and_forecast` and then estimated the same model myself mith `midas_r` using the same starting values (`rep(1,3)`) and then `$opt$counts:` The estimates are slightly dirrefing, but it is strange, because the starting values are the same... – DatamineR Jan 15 '14 at 14:19
  • What do you mean when you say you've added `$opt$counts`? To what did you add it? Note `select_and_forecast` does model lag selection using information criteria. This means that the sample size is adjusted so all the candidate models are estimated using the same sample. This means that the selected model is estimated possibly with a smaller sample. When you estimate it manually then full sample is used. If the model is specified correctly and the starting values are appropriate the difference between the results should be negligible. Otherwise it is an indication of the problem. – mpiktas Jan 15 '14 at 14:30
  • I chose the first suggested model and accessed it with `saf$bestlist[[1]][[1]]`, where `saf` is the output of my second application of the function `select_and_forecast`. I thought the data used by the function is controlled by `insample`, so I chose `insample=1:250` to make sure the complete data set is used, which I also used in `midas_r` – DatamineR Jan 15 '14 at 14:40
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/45291/discussion-between-rstudent-and-mpiktas) – DatamineR Jan 15 '14 at 15:12