4

I've looked up some web pages (but their results don't meet my needs):

I want to write a function that could do this:

Say there is a vector a.

a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)

First, find the value before and after the single and consecutive NA. In this situation is 137862, NA, NA, NA, 178337 and 178337, NA, NA, NA, NA, NA, 295530.

Second, calculate the slope in every part then replace the NA.

# 137862, NA, NA, NA, 178337
slope_1 = (178337 - 137862)/4

137862 + slope_1*1 # 1st NA replace with 147980.8
137862 + slope_1*2 # 2nd NA replace with 158099.5
137862 + slope_1*3 # 3rd NA replace with 168218.2

# 178337, NA, NA, NA, NA, NA, 295530

slope_2 = (295530 - 178337)/6

178337 + slope_2*1 # 4th NA replace with 197869.2
178337 + slope_2*2 # 5th NA replace with 217401.3
178337 + slope_2*3 # 6th NA replace with 236933.5
178337 + slope_2*4 # 7th NA replace with 256465.7
178337 + slope_2*5 # 8th NA replace with 275997.8

Finally, the expected vector should be this:

a_without_NA = c(100000, 137862, 147980.8, 158099.5, 168218.2, 178337, 197869.2, 217401.3, 
                 236933.5, 256465.7, 275997.8, 295530)

If single or consecutive NA is in the begining, then it would be keep.

# NA at begining
b = c(NA, NA, 1, 3, NA, 5, 7)

# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 3rd NA replace with 4
b_without_NA = c(NA, NA, 1, 3, 4, 5, 7)

# NA at ending
c = c(1, 3, NA, 5, 7, NA, NA)

# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 1st NA replace with 4
c_without_NA = c(1, 3, 4, 5, 7, NA, NA)

Note: in my real situation, every element of the vector is increasing(vector[n + 1] > vector[n]).

I know the principle, but I don't know how to write a self-define function to implement this.

Any help will highly appreciated!!

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
zhiwei li
  • 1,635
  • 8
  • 26

4 Answers4

5

zoo's na.approx can help :

a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
zoo::na.approx(a, na.rm = FALSE)

# [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
# [9] 236933.5 256465.7 275997.8 295530.0

b = c(NA, NA, 1, 3, NA, 5, 7)

zoo::na.approx(b, na.rm = FALSE)
#[1] NA NA  1  3  4  5  7

c = c(1, 3, NA, 5, 7, NA, NA)
zoo::na.approx(c, na.rm = FALSE)
#[1]  1  3  4  5  7 NA NA
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I got a new problem. I'd appreciate it if you could take a look at it. https://stackoverflow.com/questions/67427949/how-to-use-none-standard-evaluation-in-r – zhiwei li May 07 '21 at 01:42
  • Sorry to bother you, I have a new problem, I would appreciate it if you have time to help me look at it. (https://stackoverflow.com/questions/68141082/matching-controls-to-cases-using-multiple-conditions-in-r) – zhiwei li Jun 26 '21 at 09:47
4

Here is a base R option using approx

> approx(seq_along(a)[!is.na(a)], a[!is.na(a)], seq_along(a))$y
 [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
 [9] 236933.5 256465.7 275997.8 295530.0
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
3

Here is one approach with data.table. Get the run-length-id (rleid) of consecutive NA in 'a' ('grp'), create two temporary columns 'a1', 'a2' as the lag and lead of 'a', grouped by 'grp', create the 'tmp' based on the calculation and finally fcoalesce the original 'a' with that 'tmp'

library(data.table)
data.table(a)[, grp := rleid(is.na(a))][, 
  c('a1', 'a2') := .(shift(a), shift(a, type = 'lead'))][, 
   tmp := first(a1) + seq_len(.N) *( (last(a2) - first(a1))/(.N + 1)), 
      .(grp)][, fcoalesce(a, tmp)]
#[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 
#[7] 197869.2 217401.3 236933.5 256465.7 275997.8 295530.0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I got a new problem. I'd appreciate it if you could take a look at it. https://stackoverflow.com/questions/67427949/how-to-use-none-standard-evaluation-in-r – zhiwei li May 07 '21 at 01:44
  • Sorry to bother you, I have a new problem, I would appreciate it if you have time to help me look at it. (https://stackoverflow.com/questions/68141082/matching-controls-to-cases-using-multiple-conditions-in-r) – zhiwei li Jun 26 '21 at 09:47
3

For this purpose I defined a custom function:

my_replace_na <- function(x) {
  non <- which(!is.na(x))          # Here we extract the indices of non NA values
  
  for(i in 1:(length(non)-1)) {
    if(non[i+1] - non[i] > 1) {
      c <- non[i+1]
      b <- non[i]
      
      for(i in 1:(c - b - 1)) {
        x[b+i] <- x[b]  + ((x[c] - x[b]) / (c - b))*i
      }
    }
  }
  x
}

a <- c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
my_replace_na(a)

 [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3 236933.5 256465.7
[11] 275997.8 295530.0

# NA at begining
d <- c(NA, NA, 1, 3, NA, 5, 7)
my_replace_na(d)

[1] NA NA  1  3  4  5  7

# NA at ending
e <- c(1, 3, NA, 5, 7, NA, NA)
my_replace_na(e)

[1]  1  3  4  5  7 NA NA

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41