1

I have a time series (wave height data) for which I need to fill in the blanks through interpolation. I found the na.approx function in the zoo package to do this, but I haven't found a way to make it take into account when the missing observation is much closer to one observation that the other.

Example:

library(zoo)
test = data.frame(Wave_Height = c(1.2, NA, 0.5), Data = 
    as.POSIXct(c("2019-01-01 00:00", "2019-01-01 05:00", "2019-01-01 06:00"), 
      format = "%Y-%m-%d %H:%M"))
> test
  Wave_Height                Data
1         1.2 2019-01-01 00:00:00
2          NA 2019-01-01 05:00:00
3         0.5 2019-01-01 06:00:00

test$Wave_Height = na.approx(test$Wave_Height)
> test
  Wave_Height                Data
1        1.20 2019-01-01 00:00:00
2        0.85 2019-01-01 05:00:00
3        0.50 2019-01-01 06:00:00

I feel like there should be a weight parameter somewhere, but scanning though the documentation I haven't been able to find it. I'm looking for a result like this:

> test
  Wave_Height                Data
1        1.20 2019-01-01 00:00:00
2        0.62 2019-01-01 05:00:00
3        0.50 2019-01-01 06:00:00
Luis
  • 629
  • 4
  • 9
  • As for as I know zoo is also able to perform interpolation for irregular spaced time series. I think you have to create a zoo time series first (that zoo recognizes the timestamp correctly) and then perform na.approx. So create a zoo series out of your data.frame and then try again. – Steffen Moritz Jun 21 '19 at 01:42

1 Answers1

0

Maybe you could you a simple linear regression?

mod <- lm(Wave_Height ~ Data, test[complete.cases(test), ])
test$Wave_Height[is.na(test$Wave_Height)] <- predict(mod, newdata = test[!complete.cases(test), ])

Here is a solution using a generalized additive model and therefore not assuming linearity of the relationship:

library(mgcv)
mod <- gam(Wave_Height ~ s(Data), data=test[complete.cases(test), ])
test$Wave_Height[is.na(test$Wave_Height)] <- predict(mod, newdata = test[!complete.cases(test), ])

But you'll need to change the format of your data a little bit (see here) and might want to adapt the model specification...

jkd
  • 1,327
  • 14
  • 29
  • Thanks, but I think I'm looking for a solution to interpolate the two values to fill in my NAs, not to model the relationship between the variables. – Luis Jun 19 '19 at 18:44
  • This is exactly what the model does, it interpolates the missing values by assuming a linear relationship between them. If you check it out, you'd find that this gives 0.62 for the missing `Wave_Height`, exactly as you wanted. So in fact you already assumed the linear relationship... – jkd Jun 19 '19 at 18:55
  • I understand, and yes, for N = 2, the solution works. Problem is, my data is much bigger than this, and it's also cyclical (daily). I put in just 2 observations in my example for convenience. – Luis Jun 19 '19 at 18:58
  • Anyway you have to assume some kind of relationship between your variables if there is a "correct" way to interpolate missing values (because 0.85 was wrong in your eyes and 0.62 was right). So if the relationship is not linear you might use a generalized additive model instead, but using a model is the only solution if you want to interpolate following a "rule" (which would be the relationship you assume between your variables). – jkd Jun 19 '19 at 19:04
  • Also you could iterate over your data day by day using this linear model solution. Maybe you should just give it a try ;) – jkd Jun 19 '19 at 19:06
  • Otherwise plot your data so that we can see what kind of relationship would be most appropriate. – jkd Jun 19 '19 at 19:07
  • I'm fine with assuming a linear relationship. I have 1477 (known) observations in my original database, so I suppose I could run 1476 regressions to fill in the NAs between them, but I assume there's a easier way to do this. – Luis Jun 19 '19 at 19:26
  • How many regressions you need depends on the time frame during which you want to assume linearity. If you assume linearity over one day, you need one regression per day, not per observation. – jkd Jun 19 '19 at 20:16
  • I want to assume linearity between every two variables. – Luis Jun 19 '19 at 20:19