3

Currently I am trying to impute values in a vector in R. The conditions of the imputation are.

  • Find all NA values
  • Then check if they have an existing value before and after them
  • Also check if the value which follows the NA is larger than the value before the NA
  • If the conditions are met, calculate a mean taking the values before and after.
  • Replace the NA value with the imputed one
# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)

# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)

# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)

I started out to write code to detect the values which can be imputed. But I got stuck with the following.

# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]), 
             rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)

This however only detects the NAs which might be imputable and it only works with example one. It is incomplete and unfortunately super hard to read and understand.

Any help with this would be highly appreciated.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • Can you show the expected output for each test input? How are the first and last elements of th input to be handled given that there are not elements on either side? Can you add an example where the larger-than criterion comes into play? – G. Grothendieck Feb 15 '20 at 12:55

3 Answers3

1

We can use dplyrs lag and lead functions for that:

input_three = c(NA,NA,3,4,NA,6,NA,NA)

library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
       (lag(input_three)  + lead(input_three))/ 2,
       input_three)

Retrurns:

[1] NA NA  3  4  5  6 NA NA

Edit

Explanation:

We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors. First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:

lag offsets a vector to the right (default is 1 step):

lag(1:5)

Returns:

[1] NA  1  2  3  4

lead offsets a vector to the left:

lead(1:5)

Returns:

[1]  2  3  4  5 NA

Now to the 'test' clause of ifelse:

is.na(input_three) & lead(input_three) > lag(input_three)

Which returns:

[1]    NA    NA FALSE FALSE  TRUE FALSE    NA    NA

Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element

dario
  • 6,415
  • 2
  • 12
  • 26
1

Here's an example using the imputeTS library. It takes account of more than one NA in the sequence, ensures that the mean is calculated if the next valid observation is greater than the last valid observation and also ignores NA at the beginning and end.

library(imputeTS)
myimpute <- function(series) {
    # Find where each NA is
    nalocations <- is.na(series)
    # Find the last and the previous observation for each row
    last1 <- lag(series)
    next1 <- lead(series)
    # Carry forward the last and next observations over sequences of NA
    # Each row will then get a last and next that can be averaged
    cflast <- na_locf(last1, na_remaining = 'keep')
    cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
    # Make a data frame 
    df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
    # Calculate the mean where there is currently a NA
    # making sure that the next is greater than the last
    df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
    imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
    #list(df,  imputedseries) # comment this in and return it to see the intermediate data frame for debugging
    imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))

# [1] NA NA  3  4  5  5  6  7  7  8 NA  7  8  8  9 10 11 NA NA
Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41
0

There is also the na_ma function in the imputeTS package for imputing moving averages.

In your case this would be with the following settings:

na_ma(x, k = 1, weighting = "simple")

  • k = 1 (meaning 1 value before and 1 after the NA are taken into account)
  • weighting = "simple" (the mean of these two values is calculated)

This can be applied quite easy with basically 1 line of code:

library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple") 

You could also choose to take more values before and after the NA into account e.g. k=3. Interesting feature if you take more than 1 value to each side into account is the possibility to choose a different weighting e.g. with weighting = "linear" weights decrease in arithmetical progression (a Linear Weighted Moving Average) - meaning the further they values are away from the NA the less impact they have.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55