Constraining data imputation in R

Question

I have a data frame (df) with missing values and want to impute interpolated values with restriction. My data frame is:

X<-c(100,NA,NA,70,NA,NA,NA,NA,NA,NA,35)
Y<-c(10,NA,NA,40,NA,NA,NA,NA,NA,NA,5)
Z<-c(50,NA,NA,20,NA,NA,NA,NA,NA,NA,90)
df<-as.data.frame(cbind(X,Y,Z))
df
     X  Y  Z
1  100 10 50
2   NA NA NA
3   NA NA NA
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90

I was able to impute missing values from linear interpolation of the known values using:

 data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
     X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   65 35 30
6   60 30 40
7   55 25 50
8   50 20 60
9   45 15 70
10  40 10 80
11  35  5 90

My question is how can I put constraint to the interpolation? Say NAs more than 5 consecutive entries should remain as NAs and not be imputed by linear interpolation so that my new data frame would look like:

    X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90

clarification q: let's say you had 93 rows. The way I read it, the process should halt completely (per column) when it hits the first sequence of 5+ `NA` values? — hrbrmstr, Mar 28 '14 at 00:50
Not halt actually. If 5+ NAs happen in a sequence, the process should skip them and continue to the next NA/NAs in the same column and try to impute interpolated values provided those NAs are not more than 5 consecutive values. Thanks — Filly, Mar 28 '14 at 00:57
I think you're stuck with, perhaps, using something like `rle` and `is.na()` to figure out where the skippable sequences are then looping around them, restarting the procedure as you need. I don't think there's an "elegant" way to do it apart from that. — hrbrmstr, Mar 28 '14 at 00:59
@hrbrmstr I was hacking something together like that earlier, using `cumsum` as well. I think it might be possible to do better than having a loop but it's not pretty :) — TooTone, Mar 28 '14 at 01:05
@hrbrmstr if you have a solution do you want to post yours, otherwise I'll go ahead... — TooTone, Mar 28 '14 at 01:25
@hrbrmstr weeellll I probably shouldn't have said that... I've just used `sapply` over the RLE vector which is six of one and half a dozen of the other... — TooTone, Mar 28 '14 at 01:57

score 3 · Accepted Answer · answered Mar 28 '14 at 01:56

Here's something that works. It uses na.rm to identify NAs, rle to identify runs of NAs, and then cumsum to translate those runs into positions in the vector.

data.frame(lapply(df, function(X) {
    af = approxfun(seq_along(X), X)
    rl = rle(is.na(X))
    cu = cumsum(rl$length)
    L=5
    unlist(sapply(1:length(cu), function(x) {
        if (rl$values[x] & rl$length[x]>L) rep(NA, rl$lengths[x])
        else af(seq(cu[x]-rl$lengths[x]+1,cu[x]))
    }))
}))

Constraining data imputation in R

1 Answers1