6

I have a data frame with missing values:

X   Y   Z
54  57  57
100 58  58
NA  NA  NA
NA  NA  NA
NA  NA  NA
60  62  56
NA  NA  NA
NA  NA  NA
69  62  62

I want to impute the NA values linearly from the known values so that the dataframe looks:

X   Y    Z
54  57  57
100 58  58
90  59  57.5
80  60  57
70  61  56.5
60  62  56
63  62  58
66  62  60
69  60  62

thanks

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
Filly
  • 713
  • 12
  • 23

2 Answers2

10

Base R's approxfun() returns a function that will linearly interpolate the data it is handed.

## Make easily reproducible data
df <- read.table(text="X   Y   Z
54  57  57
100 58  58
NA  NA  NA
NA  NA  NA
NA  NA  NA
60  62  56
NA  NA  NA
NA  NA  NA
69  62  62", header=T)

## See how this works on a single vector
approxfun(1:9, df$X)(1:9)
# [1]  54 100  90  80  70  60  63  66  69

## Apply interpolation to each of the data.frame's columns
data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
#     X  Y    Z
# 1  54 57 57.0
# 2 100 58 58.0
# 3  90 59 57.5
# 4  80 60 57.0
# 5  70 61 56.5
# 6  60 62 56.0
# 7  63 62 58.0
# 8  66 62 60.0
# 9  69 62 62.0
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • You should probably ask that as a separate question. (It wouldn't be too hard to do using `rle` and `inverse.rle`, except for the way those functions handle NA's, which will necessitate a slightly more complicated approach.) – Josh O'Brien Mar 27 '14 at 21:36
  • How can I put constraint to the interpolation then. Say NAs more than 10 consecutive entries should remain NAs and not imputed. – Filly Mar 27 '14 at 21:37
  • Thanks, Just asked the question. – Filly Mar 27 '14 at 23:57
7

I can recommend the imputeTS package, which I am maintaining (even if it's for time series imputation)

For this case it would work like this:

library(imputeTS)
df$X <- na_interpolation(df$X, option ="linear")
df$Y <- na_interpolation(df$Y, option ="linear")
df$Z <- na_interpolation(df$Z, option ="linear")

As mentioned the package requires time series / vector input. (that's why each column has to be called separately)

The package offers also a lot of other imputation functions like e.g. spline interpolation.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • Welcome to Stack Overflow! You've posted several answers in quick succession, all recommending the imputeTS package. Maybe you're just a big fan, but if you are more than that you should report any affiliation within the answers themselves. You might want to read [How not to be a spammer (aka how not to appear as one)](http://stackoverflow.com/help/promotion) in the help pages. – Mogsdad May 14 '16 at 03:36