How to replace NA seperately with linear model in R

Question

I've looked up some web pages (but their results don't meet my needs):

I want to write a function that could do this:

Say there is a vector a.

a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)

First, find the value before and after the single and consecutive NA. In this situation is 137862, NA, NA, NA, 178337 and 178337, NA, NA, NA, NA, NA, 295530.

Second, calculate the slope in every part then replace the NA.

# 137862, NA, NA, NA, 178337
slope_1 = (178337 - 137862)/4

137862 + slope_1*1 # 1st NA replace with 147980.8
137862 + slope_1*2 # 2nd NA replace with 158099.5
137862 + slope_1*3 # 3rd NA replace with 168218.2

# 178337, NA, NA, NA, NA, NA, 295530

slope_2 = (295530 - 178337)/6

178337 + slope_2*1 # 4th NA replace with 197869.2
178337 + slope_2*2 # 5th NA replace with 217401.3
178337 + slope_2*3 # 6th NA replace with 236933.5
178337 + slope_2*4 # 7th NA replace with 256465.7
178337 + slope_2*5 # 8th NA replace with 275997.8

Finally, the expected vector should be this：

a_without_NA = c(100000, 137862, 147980.8, 158099.5, 168218.2, 178337, 197869.2, 217401.3, 
                 236933.5, 256465.7, 275997.8, 295530)

If single or consecutive NA is in the begining, then it would be keep.

# NA at begining
b = c(NA, NA, 1, 3, NA, 5, 7)

# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 3rd NA replace with 4
b_without_NA = c(NA, NA, 1, 3, 4, 5, 7)

# NA at ending
c = c(1, 3, NA, 5, 7, NA, NA)

# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 1st NA replace with 4
c_without_NA = c(1, 3, 4, 5, 7, NA, NA)

Note: in my real situation, every element of the vector is increasing(vector[n + 1] > vector[n]).

I know the principle, but I don't know how to write a self-define function to implement this.

Any help will highly appreciated!!

@ akrun, Sorry, it's a mistake, I have updated my code. – zhiwei li May 04 '21 at 02:49 — zhiwei li, May 04 '21 at 02:49

score 5 · Answer 1 · answered May 04 '21 at 05:09

5

zoo's na.approx can help :

a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
zoo::na.approx(a, na.rm = FALSE)

# [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
# [9] 236933.5 256465.7 275997.8 295530.0

b = c(NA, NA, 1, 3, NA, 5, 7)

zoo::na.approx(b, na.rm = FALSE)
#[1] NA NA  1  3  4  5  7

c = c(1, 3, NA, 5, 7, NA, NA)
zoo::na.approx(c, na.rm = FALSE)
#[1]  1  3  4  5  7 NA NA

answered May 04 '21 at 05:09

Ronak Shah

377,200
20
156
213

I got a new problem. I'd appreciate it if you could take a look at it. https://stackoverflow.com/questions/67427949/how-to-use-none-standard-evaluation-in-r – zhiwei li May 07 '21 at 01:42
Sorry to bother you, I have a new problem, I would appreciate it if you have time to help me look at it. (https://stackoverflow.com/questions/68141082/matching-controls-to-cases-using-multiple-conditions-in-r) – zhiwei li Jun 26 '21 at 09:47

score 4 · Answer 2 · answered May 04 '21 at 08:00

4

Here is a base R option using approx

> approx(seq_along(a)[!is.na(a)], a[!is.na(a)], seq_along(a))$y
 [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
 [9] 236933.5 256465.7 275997.8 295530.0

answered May 04 '21 at 08:00

ThomasIsCoding

96,636
9
24
81

akrun · Accepted Answer · 2021-05-04T02:53:11.203

3

Here is one approach with data.table. Get the run-length-id (rleid) of consecutive NA in 'a' ('grp'), create two temporary columns 'a1', 'a2' as the lag and lead of 'a', grouped by 'grp', create the 'tmp' based on the calculation and finally fcoalesce the original 'a' with that 'tmp'

library(data.table)
data.table(a)[, grp := rleid(is.na(a))][, 
  c('a1', 'a2') := .(shift(a), shift(a, type = 'lead'))][, 
   tmp := first(a1) + seq_len(.N) *( (last(a2) - first(a1))/(.N + 1)), 
      .(grp)][, fcoalesce(a, tmp)]
#[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 
#[7] 197869.2 217401.3 236933.5 256465.7 275997.8 295530.0

edited May 04 '21 at 02:53

answered May 04 '21 at 02:47

akrun

874,273
37
540
662

I got a new problem. I'd appreciate it if you could take a look at it. https://stackoverflow.com/questions/67427949/how-to-use-none-standard-evaluation-in-r – zhiwei li May 07 '21 at 01:44
Sorry to bother you, I have a new problem, I would appreciate it if you have time to help me look at it. (https://stackoverflow.com/questions/68141082/matching-controls-to-cases-using-multiple-conditions-in-r) – zhiwei li Jun 26 '21 at 09:47

Anoushiravan R · Answer 4 · 2021-08-20T20:11:55.020

For this purpose I defined a custom function:

my_replace_na <- function(x) {
  non <- which(!is.na(x))          # Here we extract the indices of non NA values
  
  for(i in 1:(length(non)-1)) {
    if(non[i+1] - non[i] > 1) {
      c <- non[i+1]
      b <- non[i]
      
      for(i in 1:(c - b - 1)) {
        x[b+i] <- x[b]  + ((x[c] - x[b]) / (c - b))*i
      }
    }
  }
  x
}

a <- c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
my_replace_na(a)

 [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3 236933.5 256465.7
[11] 275997.8 295530.0

# NA at begining
d <- c(NA, NA, 1, 3, NA, 5, 7)
my_replace_na(d)

[1] NA NA  1  3  4  5  7

# NA at ending
e <- c(1, 3, NA, 5, 7, NA, NA)
my_replace_na(e)

[1]  1  3  4  5  7 NA NA

How to replace NA seperately with linear model in R

4 Answers4

Linked