7

Searching for linear interpolation of time series data in R, I often found recommendations to use na.approx() from the zoo package.

However, with irregular timeseries I experienced problems, because interpolated values are distributed evenly across the number of gaps, not taking into account the associated time stamp of the value.

I found a work around using approxfun(), but I wonder whether there is a cleaner solution, ideally based on tsibble objects with functions from the tidyverts package family?

Previous answers relied on expanding the irregular date grid to a regular grid by filling the gaps. However, this causes problems when daytime should be taken into account during interpolating.

Here comes a (revised) minimal example with POSIXct timestamp rather than Date only:

library(tidyverse)
library(zoo)

df <- tibble(date = as.POSIXct(c("2000-01-01 00:00", "2000-01-02 02:00", "2000-01-05 00:00")),
             value = c(1,NA,2))

df %>% 
  mutate(value_int_wrong = na.approx(value),
         value_int_correct = approxfun(date, value)(date))

# A tibble: 3 x 4
  date                value value_int_wrong value_int_correct
  <dttm>              <dbl>           <dbl>             <dbl>
1 2000-01-01 00:00:00     1             1                1   
2 2000-01-02 02:00:00    NA             1.5              1.27
3 2000-01-05 00:00:00     2             2                2   

Any ideas how to (efficently) deal with this? Thanks for your support!

  • Hi Jens, have you found a satisfying solution for your problem yet? I'd be interested. – mabe Aug 13 '20 at 08:10

2 Answers2

5

Here is an equivalent tsibble-based solution. The interpolate() function needs a model, but you can use a random walk to give linear interpolation between points.

library(tidyverse)
library(tsibble)
library(fable)
#> Loading required package: fabletools

df <- tibble(
  date = as.Date(c("2000-01-01", "2000-01-02", "2000-01-05", "2000-01-06")),
  value = c(1, NA, 2, 1.5)
) %>%
  as_tsibble(index = date) %>%
  fill_gaps()

df %>%
  model(naive = ARIMA(value ~ -1 + pdq(0,1,0) + PDQ(0,0,0))) %>%
  interpolate(df)
#> # A tsibble: 6 x 2 [1D]
#>   date       value
#>   <date>     <dbl>
#> 1 2000-01-01  1   
#> 2 2000-01-02  1.25
#> 3 2000-01-03  1.5 
#> 4 2000-01-04  1.75
#> 5 2000-01-05  2   
#> 6 2000-01-06  1.5

Created on 2020-04-08 by the reprex package (v0.3.0)

Rob Hyndman
  • 30,301
  • 7
  • 73
  • 85
  • Hi Rob, thank you very much for you answer. I hoped you would take a look! I had to revise my minimal example, because in reality I deal with timeseries that also resolve time of the day. I tried to run your code over my revised example data set, but this caused an error message ("Could not find an appropriate ARIMA model. This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable."). Can your solution be adopted to POSIXct? Thanks for sharing your expertise! – Jens Daniel Müller Apr 08 '20 at 07:57
  • I updated my answer to be more specific in case the POSIXct was confusing it into picking a seasonal model. If it still causes an error, can you please post a bug report with a reproducible example at https://github.com/tidyverts/fable/issues – Rob Hyndman Apr 08 '20 at 09:21
  • Hi Rob, thanks again, but it does not seem to run with my minimal example. I opened an issue at https://github.com/tidyverts/fable/issues/256 – Jens Daniel Müller Apr 08 '20 at 12:44
0

Personally, I would go with the solution that you are using but to show how to use na.approx in this case we can complete the sequence of dates before using na.approx and join it with original df to keep original rows.

library(dplyr)

df %>% 
  tidyr::complete(date = seq(min(date), max(date), by = "day")) %>%
  mutate(value_int = zoo::na.approx(value)) %>%
  right_join(df, by = "date") %>%
  select(date, value_int)


#  date       value_int
#  <date>         <dbl>
#1 2000-01-01      1   
#2 2000-01-02      1.25
#3 2000-01-05      2   
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi Ronak, thanks for your immediate answer. I'm afraid that your proposed solution will work when the date vector has a high temporal resolution? I did not cover this in my minimal example, but usually the environmental time series I'm working with have a resolution of seconds, but still measurements only every couple of days. – Jens Daniel Müller Apr 07 '20 at 11:42
  • Well, it might be inefficient but I think it should still work. – Ronak Shah Apr 07 '20 at 11:50