na.approx and na.locf not behaving properly

Question

I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.

Our rules are:

Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
If values are missing in the middle of a time series, linear extrapolation is used.

# load library for imputation
library(zoo)

# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)

# add imputated values
output_table_imp <- output_table_imp %>%
  group_by(transport_mode, country_code) %>%
  mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>%   # linear interpolation
  mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
  mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)

My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...

These are the original numbers:

And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.

The data frame looks like this:

score 1 · Answer 1 · answered Jul 06 '21 at 00:16

Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.

With the imputeTS package you could just take your whole data. frame (it assumes each column is a separate time series) and the package performs imputation for each of this series. (unfortunately your code has no data, but I guess this would be your output_table_imp data.frame after expansion)

Just like this:

library("imputeTS")
na_interpolation(output_table_imp, option = "linear")

We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.

These were your requirements:

Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.

Here a toy example:

# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)

# Perform linear interpolation
na_interpolation(test, option = "linear")

#Results
> 1 1 1 2 3 4 5 6 7 8 8 8

So see, this works perfectly fine.

Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):

# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)

na_interpolation(df, option = "linear")

Thank you for your suggestion. I have added a picture to show what the data frame looks like. It's not shaped like an actual time series ... But I think I can work with the imputeTS package, I'll give it a go later this week — Freya Slootmans, Jul 06 '21 at 07:49

na.approx and na.locf not behaving properly

1 Answers1