2

I have a dataset (1.6M rows, 4 columns of interest) organized by a directed country dyad-year. Each non-commutative dyad (year1-stateA-stateB doesn't always equal year1-stateB-stateA) has an output value 'var1'.

Simplified example of data

library(forecast)
library(dplyr)

df=data.frame(year=c(1994,1995,1996,1997,1998,1964,1965,1967,1968,1969,1988,1987,1988,1989),
          stateA=c(1,1,1,1,1,138,138,138,138,138,20,20,20,20),
          stateB=c(2,2,2,2,2,87,87,87,87,87,55,55,55,55),
          var1=c(0.101,0.132,0.136,0.136,0.148,-0.287,-0.112,0.088,0.101,0.121,0.387,NA,0.377,0.388)
)

> df
   year stateA stateB   var1
1  1994      1      2  0.101
2  1995      1      2  0.132
3  1996      1      2  0.136
4  1997      1      2  0.136
5  1998      1      2  0.148
6  1964    138     87 -0.287
7  1965    138     87 -0.112
8  1967    138     87  0.088
9  1968    138     87  0.101
10 1969    138     87  0.121
11 1988     20     55  0.387
12 1987     20     55     NA
13 1988     20     55  0.377
14 1989     20     55  0.388

What I would like to do is to break down each set of country dyads as a time series and create a forecast prediction using holt's model for the following year using the past 5 years of data.

Expected result: I am hoping to add a new variable which contains the forecasted value for the yearX+1 based on the previous years to the row for yearX.

Complications: Not every country dyad exists for every year and for some years there is not data despite the country dyad existing in the dataset.

What I've done so far:

First, forgive me I've only just recently started using time series in R.

First, I used dplr to organize the data by year (so it will be in proper time series order) then grouped by stateA, stateB

 rolldata <- df %>%
  dplyr::arrange(year) %>% 
  dplyr::group_by(stateA, stateB) %>% [...]

What I was doing before was a 5-year rolling average, which did not fit my analysis needs so this is what it looked like:

rolldata <- df %>%
  dplyr::arrange(year) %>% 
  dplyr::group_by(stateA, stateB) %>% 
  dplyr::mutate(
    point_5a = zoo::rollmean(var1, k = 5, fill = NA, align='right'))

The issue here is that I need to create a time series object for each line to pass to holt() to output the forecasted value (fvar).

dat_ts <- ts(df$var1, start = c(STARTYEAR, 1), end = c(ROWYEAR, 1), frequency = 1)
holt_model <- holt(dat_ts, h = 5)
fvar[i] <-holt_model$x[1]

I hope I have covered the issue in a comprehensible way. Your assistance is most appreciated and I am ready to clarify and answer any questions that might help you to help me.

P.S. Efficiency is not necessary, only results.

EDIT: I do not think I was being clear before but my main goal is to produce an forecast object for each line instead of the subset as a whole. In my example data for country 1 and country 2: there would be a forecast for 1994 based on a time series of 1994; there would be a forecast for 1995 based on 1994-1995; a forecast for 1996 based on 1994-1996. Then the same goes for the pair (138, 87), each row having it's own forecast.

jay.sf
  • 60,139
  • 8
  • 53
  • 110

2 Answers2

1

paste state columns together for split/apply using by. holt warns of action taken because of the missings. You could interpolate them, I'm not sure, though what you want to do with them.

rolldata <- df[order(df$year), ]

library(forecast)
res <- by(rolldata, Reduce(paste, rolldata[c("stateA", "stateB")]), function(x) {
  STARTYEAR <- x$year[1]; ROWYEAR <- x$year[nrow(x)]
  x0 <- x$var1
  x_ts <- ts(x0, start = c(STARTYEAR, 1), end = c(ROWYEAR, 1), frequency = 1)
  holt_mod <- holt(x_ts, h=5)
  holt_mod$x[1]
})
# Warning message:
#   In ets(x, "AAN", alpha = alpha, beta = beta, phi = phi, damped = damped,  :
#            Missing values encountered. Using longest contiguous portion of time series

as.list(res)
# $`1 2`
# [1] 0.101
# 
# $`138 87`
# [1] -0.287
# 
# $`20 55`
# [1] 0.387
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks for your response and I apologize for my delay in responding. As I edited in my post, I was hoping for a forecast for each row based on that year and the ones preceding it for each unique set of country pairs. Therefore, I do not think that paste would be best. Do you have any other solutions? Thanks. – Karimic Magellan Mar 05 '21 at 07:24
1

You can store an object of class forecast in dataframe itself and extract the relevant values from it.

library(dplyr)
library(purrr)
library(forecast)

df %>%
  arrange(year) %>% 
  group_by(stateA, stateB) %>%
  summarise(ts = list(ts(var1, start = c(min(year), 1), 
                               end = c(max(year), 1), frequency = 1)), 
            holt = map(ts, holt, h = 5), 
            all_value = map(holt, ~.x$x),
            first_value = map_dbl(all_value, first)) %>%
    ungroup

#  stateA stateB ts       holt       all_value first_value
#   <dbl>  <dbl> <list>   <list>     <list>          <dbl>
#1      1      2 <ts [5]> <forecast> <ts [5]>        0.101
#2     20     55 <ts [3]> <forecast> <ts [2]>        0.387
#3    138     87 <ts [6]> <forecast> <ts [6]>       -0.287
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I did not know that about storing the forecast as an object in the dataframe and it certainly has a lot of potential for my further work on this project however... as I added in an edit in the original post, I am not looking for one forecast for each pair of countries but rather one forecast for each year for the country pairs. Each row should have a forecast based on that year and the ones preceding it. Please let me know if that clarifies things. Thank you. – Karimic Magellan Mar 05 '21 at 07:28
  • I don't understand. If you want to do this for every year would adding `year` in `group_by` help? There are lot of years for a pair where you have only 1 observation so this will not work in that case. Moreover, `start` and `end` of `ts` would be the same. – Ronak Shah Mar 05 '21 at 07:39
  • In my example data for country "1" and country "2": there would be a forecast for 1994 based on a time series of 1994; there would be a forecast for 1995 based on 1994-1995; a forecast for 1996 based on 1994-1996; a forecast for 1997 based on 1994-1997... etc. Does that clarify? The example data is very limited but in my actual dataset each pair of countries typically have about 30-60 years. – Karimic Magellan Mar 05 '21 at 15:13