0

Assume the data look like:

df <- data.frame(ID=1:6, Value=c(NA, 1, NA, NA, 2, NA))
df
  ID Value
1  1    NA
2  2     1
3  3    NA
4  4    NA
5  5     2
6  6    NA

And I want the imputed result be like:

  ID Value
1  1   1.0
2  2   1.0
3  3   1.5
4  4   1.5
5  5   2.0
6  6   2.0

More specific, I want to impute missing data with mean of first previous and latter non missing data, if only one of previous or latter non missing data exist, impute with this non missing data. Behavior for all data are missing is not defined.

How can I do that in R?

Bamqf
  • 3,382
  • 8
  • 33
  • 47
  • 1
    This seems to be what you're looking for: http://stackoverflow.com/questions/15308205/mean-before-after-imputation-in-r – Frank Jun 19 '15 at 18:09
  • imputeTS::interpolation and zoo::approx might be worth a look, to get a solution similar to the requested one ( not 100% the requested result indeed) – Steffen Moritz Dec 07 '17 at 14:26

3 Answers3

1

Take a look at the design of approxfun with rule=2. This isn't exactly what you asked for (since it does a linear interpolation across the NA gaps rather than substituting the mean of the gap endpoints), but it might be acceptable:

> approxfun(df$ID, df$Value, rule=2)(df$ID)
[1] 1.000000 1.000000 1.333333 1.666667 2.000000 2.000000

With rule=2 it does behave as you desired at the extremes. There are also na.approx methods in the zoo-package.

I would caution against using such data for any further statistical inference. This method of imputation is essentially saying there is no possibility of random variation during periods of no measurement, and the world is generally not so consistent.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Use na.locf both forwards and backwards and take their average:

library(zoo)

both <- cbind( na.locf(df$Value, na.rm = FALSE), 
               na.locf(df$Value, na.rm = FALSE, fromLast = TRUE))
transform(df, Value = rowMeans(both, na.rm = TRUE))

giving:

  ID Value
1  1   1.0
2  2   1.0
3  3   1.5
4  4   1.5
5  5   2.0
6  6   2.0
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

This should work.

for( i in 1:nrow(df)){
    if(is.na(df$Value[i])){
        df$Value[i] <- mean(df$Value[1:i])
    }
}

I don't know if this is exactly what you want. I didn't understand your statement. "I want to impute missing data with mean of first previous and latter non missing data, if only one of previous or latter non missing data exist, impute with this non missing data"

What values do you want to find the mean of to replace the NAs?

Buzz Lightyear
  • 824
  • 1
  • 7
  • 18