1

I have a panel with missing values that I need to interpolate.

a <- data.frame(id= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2), year=1:7, index=c(1,NA,NA,NA,3,NA,NA, 2,NA,NA,NA,5,NA,NA))

The issue is that I don't have the final value, so I would like to use the same interpolation line to predict values above the last observation reported. Additionally, the interpolation should be "by id", rather than considering the first observation of the next id as part of the line.

I've tried the basic interpolation but it does stop at the latest values, or predicts using the next id:

a <- na.approx(a)

output I get:

       id year    index
 [1,]  1    1 1.000000
 [2,]  1    2 1.500000
 [3,]  1    3 2.000000
 [4,]  1    4 2.500000
 [5,]  1    5 3.000000
 [6,]  1    6 2.666667
 [7,]  1    7 2.333333
 [8,]  2    1 2.000000
 [9,]  2    2 2.750000
[10,]  2    3 3.500000
[11,]  2    4 4.250000
[12,]  2    5 5.000000
[13,]  2    6       NA
[14,]  2    7       NA

desired output:

        id year    index
 [1,]  1    1 1.000000
 [2,]  1    2 1.500000
 [3,]  1    3 2.000000
 [4,]  1    4 2.500000
 [5,]  1    5 3.000000
 [6,]  1    6 3.500000
 [7,]  1    7 4.000000
 [8,]  2    1 2.000000
 [9,]  2    2 2.750000
[10,]  2    3 3.500000
[11,]  2    4 4.250000
[12,]  2    5 5.000000
[13,]  2    6 5.750000
[14,]  2    7 6.500000
    
Antonio
  • 158
  • 1
  • 14
  • I'm not sure that was clear, but with "same interpolation line" I meant I would like to keep predicting NA assuming the slope to be the same after the last observed point – Antonio Nov 06 '20 at 14:50
  • i get this error: 0 (non-NA) cases – Antonio Nov 06 '20 at 15:42

4 Answers4

3

With lm you can get the slope that this simple interpolation is using, and then use that slope to generate new values with predict. But maybe there's a simpler solution

mod <- lm(index ~ year, a)

a[,2] <- predict(mod, newdata=data.frame(year=a$year))

EDIT 1

No, for each id we will run a different lm. To do that we select the part of a with a unique id inside a loop, and run the lm only with that part:

for(i in unique(a$id)){
  ai = a[a$id==i,]
  mod = lm(index ~ year, ai)
  a[a$id==i,3] = predict(mod, newdata=data.frame(year=ai$year))}
  • Thank you for your answer, I was not general enough in my question. I edited it, sorry – Antonio Nov 06 '20 at 15:16
  • Ok this is smart. I think it should work but I get this error now: 0 (non-NA) cases – Antonio Nov 06 '20 at 15:45
  • But it does work nonetheless, so that's good! thanks! – Antonio Nov 06 '20 at 15:49
  • 1
    My guess is that this error can happen if an `id` is full of NA's, so you can't run a regression without observations. To help you with how to work that out i'd need more info on the context of what you're doing. You could just accept that error, and ignore that specific id, or use the mean of the intercepts and slopes for the other `id`'s for example, but that's arbitrary. You're welcome! – Ricardo Semião e Castro Nov 06 '20 at 16:04
  • That's fine, I'll drop those ids. On a different note: I tried putting the predict function inside a ceiling() one to get only integer predictions but it's giving me weird results (always increasing even if the slope is flat). Any idea why? – Antonio Nov 06 '20 at 16:12
  • 1
    By "flat slope" you mean slope coef. = 0? If yes then I don't know why is that happening... Maybe try `trunc` or `round` to see if the problem remains. – Ricardo Semião e Castro Nov 06 '20 at 16:23
  • 1
    Yes, I meant coef=0, actually using round() solved it. thanks again – Antonio Nov 06 '20 at 16:31
2

Assuming the following data which is modified from the question in light of the comments, we define a function which returns NA if all index values are NA and na.spline otherwise and then apply it per id:

library(zoo)

a <- data.frame(id= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2), year=1:7, 
  index=c(NA,NA,NA,NA,NA,NA,NA, 2,NA,NA,NA,5,NA,NA))

na_spline <- function(x) if (all(is.na(x))) NA else na.spline(x)
transform(a, index = ave(index, id, FUN = na_spline))
##    id year index
## 1   1    1    NA
## 2   1    2    NA
## 3   1    3    NA
## 4   1    4    NA
## 5   1    5    NA
## 6   1    6    NA
## 7   1    7    NA
## 8   2    1  2.00
## 9   2    2  2.75
## 10  2    3  3.50
## 11  2    4  4.25
## 12  2    5  5.00
## 13  2    6  5.75
## 14  2    7  6.50
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I'm sorry, I was not general enough in my first post, I edited it – Antonio Nov 06 '20 at 15:24
  • It's weird, I keep getting this error: Error in splinefun(x[!na], y[!na], ...) : zero non-NA points – Antonio Nov 06 '20 at 15:51
  • maybe the error comes from ids for which all values of index are missing. while @ricardo's answer goes around that by getting the error only for the "empty" ids while computing the remaining ones where at least 2 values of "index" are there – Antonio Nov 06 '20 at 15:53
  • I'm sorry i meant there might be ids for which all values of index are missing. I had not realized that – Antonio Nov 06 '20 at 15:54
  • Your answer is more cleaner than ricardo's one, but looping through ids he solves the missing index problem. thanks a lot anyway – Antonio Nov 06 '20 at 15:59
  • As I said in 2 previous comments, I don't have any missing ids, I have ids for which all values of index are missing. I really appreciate your help but please read carefully my comments so you don't have to update answers for no reason – Antonio Nov 06 '20 at 16:10
  • that's a good solution. thank you and sorry for the confusion – Antonio Nov 06 '20 at 16:28
  • Have revised answer. – G. Grothendieck Nov 06 '20 at 17:53
0

Does this work:

library(dplyr)
library(tidyr)
library(purrr)
a %>% mutate(index = replace_na(index, 0)) %>% 
      mutate(index = accumulate(index, ~ .5 + .x))
  year index
1    1   1.0
2    2   1.5
3    3   2.0
4    4   2.5
5    5   3.0
6    6   3.5
7    7   4.0
> 
Karthik S
  • 11,348
  • 2
  • 11
  • 25
0

It looks like the values follow a linear pattern. You could then use fit=lm(index~year, data=a) and then predict the values using a$index=fit$coef[2]*a$year+fit$coef[1].

Michal J Figurski
  • 1,262
  • 1
  • 11
  • 18