1

I have a dataframe in which a column (x) has some missing values and I would like to create a new column (y) replacing these NAs with the nearest value already existing. Also I would like that if the distance is the same both ways, I get the mean of both.

Here is an example of what I mean:

data <- data.frame(x = c(2, 3, 3, NA, NA, 4, NA, 3, 2, NA, NA, NA, 4))

The dataframe I would like to obtain would be that:

x y
2 2
3 3
3 3
NA 3
NA 4
4 4
NA 3.5
3 3
2 2
NA 2
NA 3
NA 4
4 4

Thanks in advance

1 Answers1

2

Your case is slightly different from the usual cases, where interpolation is needed from the first NA. Here, interpolation is necessary only when the NA sequence is 1, 3, or more, and for the first and last value in a sequence of NA, you need the closest non-NA.

You can use lead and lag to do so, and then apply interpolation with zoo::na.approx.

library(dplyr)
library(zoo)
data %>% 
  mutate(y = case_when(is.na(x) & !is.na(lag(x)) & is.na(lead(x)) ~ lag(x),
                       is.na(x) & is.na(lag(x)) & !is.na(lead(x)) ~ lead(x),
                       TRUE ~ x) %>% 
           na.approx(.))

output

    x   y
1   2 2.0
2   3 3.0
3   3 3.0
4  NA 3.0
5  NA 4.0
6   4 4.0
7  NA 3.5
8   3 3.0
9   2 2.0
10 NA 2.0
11 NA 3.0
12 NA 4.0
13  4 4.0
Maël
  • 45,206
  • 3
  • 29
  • 67
  • will this work if there were at least 4NAs? I do not think so. – Onyambu Oct 20 '22 at 08:22
  • 1
    It should be working. The first and last value of the sequence will be the lag and lead, and the 2 middle values will be interpolated. Test with `data.frame(x = c(2, 3, 3, NA, NA, NA, NA, 4, 2, NA, NA, NA, 4))` – Maël Oct 20 '22 at 08:29
  • 1
    yes, and am curious as to whether OP wants the values approximated or continue replacement. eg NA,NA,NA,NA to be 3,3,4,4 – Onyambu Oct 20 '22 at 08:40