I have a dataset (1.6M rows, 4 columns of interest) organized by a directed country dyad-year. Each non-commutative dyad (year1-stateA-stateB doesn't always equal year1-stateB-stateA) has an output value 'var1'.
Simplified example of data
library(forecast)
library(dplyr)
df=data.frame(year=c(1994,1995,1996,1997,1998,1964,1965,1967,1968,1969,1988,1987,1988,1989),
stateA=c(1,1,1,1,1,138,138,138,138,138,20,20,20,20),
stateB=c(2,2,2,2,2,87,87,87,87,87,55,55,55,55),
var1=c(0.101,0.132,0.136,0.136,0.148,-0.287,-0.112,0.088,0.101,0.121,0.387,NA,0.377,0.388)
)
> df
year stateA stateB var1
1 1994 1 2 0.101
2 1995 1 2 0.132
3 1996 1 2 0.136
4 1997 1 2 0.136
5 1998 1 2 0.148
6 1964 138 87 -0.287
7 1965 138 87 -0.112
8 1967 138 87 0.088
9 1968 138 87 0.101
10 1969 138 87 0.121
11 1988 20 55 0.387
12 1987 20 55 NA
13 1988 20 55 0.377
14 1989 20 55 0.388
What I would like to do is to break down each set of country dyads as a time series and create a forecast prediction using holt's model for the following year using the past 5 years of data.
Expected result: I am hoping to add a new variable which contains the forecasted value for the yearX+1 based on the previous years to the row for yearX.
Complications: Not every country dyad exists for every year and for some years there is not data despite the country dyad existing in the dataset.
What I've done so far:
First, forgive me I've only just recently started using time series in R.
First, I used dplr to organize the data by year (so it will be in proper time series order) then grouped by stateA, stateB
rolldata <- df %>%
dplyr::arrange(year) %>%
dplyr::group_by(stateA, stateB) %>% [...]
What I was doing before was a 5-year rolling average, which did not fit my analysis needs so this is what it looked like:
rolldata <- df %>%
dplyr::arrange(year) %>%
dplyr::group_by(stateA, stateB) %>%
dplyr::mutate(
point_5a = zoo::rollmean(var1, k = 5, fill = NA, align='right'))
The issue here is that I need to create a time series object for each line to pass to holt()
to output the forecasted value (fvar).
dat_ts <- ts(df$var1, start = c(STARTYEAR, 1), end = c(ROWYEAR, 1), frequency = 1)
holt_model <- holt(dat_ts, h = 5)
fvar[i] <-holt_model$x[1]
I hope I have covered the issue in a comprehensible way. Your assistance is most appreciated and I am ready to clarify and answer any questions that might help you to help me.
P.S. Efficiency is not necessary, only results.
EDIT: I do not think I was being clear before but my main goal is to produce an forecast object for each line instead of the subset as a whole. In my example data for country 1 and country 2: there would be a forecast for 1994 based on a time series of 1994; there would be a forecast for 1995 based on 1994-1995; a forecast for 1996 based on 1994-1996. Then the same goes for the pair (138, 87), each row having it's own forecast.