I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.
Our rules are:
- Values missing at the end of a time series are given the last known value in the series.
- Values missing at the beginning of a time series are given the first known value in the series.
- If values are missing in the middle of a time series, linear extrapolation is used.
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...
These are the original numbers:
And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.