1

I have a data frame (df) with missing values and want to impute interpolated values with restriction. My data frame is:

X<-c(100,NA,NA,70,NA,NA,NA,NA,NA,NA,35)
Y<-c(10,NA,NA,40,NA,NA,NA,NA,NA,NA,5)
Z<-c(50,NA,NA,20,NA,NA,NA,NA,NA,NA,90)
df<-as.data.frame(cbind(X,Y,Z))
df
     X  Y  Z
1  100 10 50
2   NA NA NA
3   NA NA NA
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90

I was able to impute missing values from linear interpolation of the known values using:

 data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
     X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   65 35 30
6   60 30 40
7   55 25 50
8   50 20 60
9   45 15 70
10  40 10 80
11  35  5 90

My question is how can I put constraint to the interpolation? Say NAs more than 5 consecutive entries should remain as NAs and not be imputed by linear interpolation so that my new data frame would look like:

    X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90
Community
  • 1
  • 1
Filly
  • 713
  • 12
  • 23
  • clarification q: let's say you had 93 rows. The way I read it, the process should halt completely (per column) when it hits the first sequence of 5+ `NA` values? – hrbrmstr Mar 28 '14 at 00:50
  • Not halt actually. If 5+ NAs happen in a sequence, the process should skip them and continue to the next NA/NAs in the same column and try to impute interpolated values provided those NAs are not more than 5 consecutive values. Thanks – Filly Mar 28 '14 at 00:57
  • I think you're stuck with, perhaps, using something like `rle` and `is.na()` to figure out where the skippable sequences are then looping around them, restarting the procedure as you need. I don't think there's an "elegant" way to do it apart from that. – hrbrmstr Mar 28 '14 at 00:59
  • @hrbrmstr I was hacking something together like that earlier, using `cumsum` as well. I think it might be possible to do better than having a loop but it's not pretty :) – TooTone Mar 28 '14 at 01:05
  • @hrbrmstr if you have a solution do you want to post yours, otherwise I'll go ahead... – TooTone Mar 28 '14 at 01:25
  • go for it. wld be interested to see it w/o an arduous loop. – hrbrmstr Mar 28 '14 at 01:46
  • @hrbrmstr weeellll I probably shouldn't have said that... I've just used `sapply` over the RLE vector which is six of one and half a dozen of the other... – TooTone Mar 28 '14 at 01:57

1 Answers1

3

Here's something that works. It uses na.rm to identify NAs, rle to identify runs of NAs, and then cumsum to translate those runs into positions in the vector.

data.frame(lapply(df, function(X) {
    af = approxfun(seq_along(X), X)
    rl = rle(is.na(X))
    cu = cumsum(rl$length)
    L=5
    unlist(sapply(1:length(cu), function(x) {
        if (rl$values[x] & rl$length[x]>L) rep(NA, rl$lengths[x])
        else af(seq(cu[x]-rl$lengths[x]+1,cu[x]))
    }))
}))
TooTone
  • 7,129
  • 5
  • 34
  • 60