Filling NA using linear regression in R

Question

I have a data with one time column and 2 variables.(example below)

df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
                              25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA, 
                              120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85, 
                              130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time", 
                              "var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df", 
                              "tbl", "data.frame"))

The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.

Please Help!! And let me know if you need more information

What did you try? Run a linear model (your NAs will be omitted automatically), then `predict` on the rows with NA. Fill in your original data with predicted values. Where are you stuck? Are you familiar with `lm()`? — Gregor Thomas, Apr 03 '18 at 16:10
No I'm not familiar with lm(). And I couldn't find help regarding this in stackoverflow. Can you show me? — Kathiravan Meeran, Apr 03 '18 at 16:12

score 4 · Accepted Answer · answered Apr 03 '18 at 16:14

Here is an example using lm to predict values in R.

library(dplyr)

# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))

fit <- lm(var1 ~ var2, data = df2)

# See the result
summary(fit)

# Call:
#   lm(formula = var1 ~ var2, data = df2)
# 
# Residuals:
#   1          2          3          4          5          6 
# 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
# 
# Coefficients:
#   Estimate Std. Error   t value Pr(>|t|)    
# (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
#   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
# 
# Warning message:
#   In summary.lm(fit) : essentially perfect fit: summary may be unreliable

# Use fit to predict the value
df3 <- df %>% 
  mutate(pred = predict(fit, .)) %>%
  # Replace NA with pred in var1
  mutate(var1 = ifelse(is.na(var1), pred, var1))

# See the result
df3 %>% as.data.frame()

#    time  var1   var2  pred
# 1    15  20.4  30.60  20.4
# 2    16  31.5  47.25  31.5
# 3    17  42.6  63.90  42.6
# 4    18  53.7  80.55  53.7
# 5    19  64.8  97.20  64.8
# 6    20  75.9 113.85  75.9
# 7    21  87.0 130.50  87.0
# 8    22  98.1 147.15  98.1
# 9    23 109.2 163.80 109.2
# 10   24 120.3 180.45 120.3
# 11   25 131.4 197.10 131.4
# 12   26 142.5 213.75 142.5

Dave2e · Answer 2 · 2018-04-03T16:34:15.723

2

Here is a one liner using the approx function from base R:

newvar1<-approx(df$time, df$var1, xout=df$time)

This function will apply a linear approximation between neighboring points was opposed to "www" answer which applies the linear approximation across all of the points. With this data, both solutions provide the same results since time and var1 has a perfect linear relationship, may not always be the case.
The xout option specifies the location where to estimate the new values, in this case I am passing the original time vector.

Related: See the spline function for a cubic approximation.

edited Apr 03 '18 at 16:34

answered Apr 03 '18 at 16:21

Dave2e

22,192
18
42
50

Good to learn the `approx` function. So I gave you an upvote. But if I understood the question correctly, the OP asked linear regression between `var1` and `var2`, while your solution is linear interpolation only in `var1`. Am I correct? – www Apr 03 '18 at 16:24
This is linear interpolation to fill in the NA values for `var1` assuming `time` is the independent value. This problem is not realistic since `time`, `var1` and `var2` are all 100% linear dependent. – Dave2e Apr 03 '18 at 16:31
@Dave2e Thanks for your solution. Thanks of introducing approx function. I'll use it in cases where I need to fill my variable using time. Thanks again – Kathiravan Meeran Apr 03 '18 at 16:34

BGM · Answer 3 · 2020-11-23T18:14:52.200

I realize this is an old question but this might be a useful brute-force technique

generate your linear model

fit <- lm(var1 ~ var2, data = df)

Save the coefficients into an object using coef()

fit.c <- coef(fit)
fit.c

Use those coefficient to generate a predicted value as a new variable. The bracketed numbers indicate the position of the coefficient in the vector fit.c. fit.c[1] is the intercept.

df$pred <- fit.c[1] + fit.c[2]*df$var2

You may at this time replace NA values in the original variable

df$var1[is.na(df$var1)] <- df$pred

However my instincts say to not overwrite values in your original variable and instead use pred for whatever purpose you planned for var1.

Filling NA using linear regression in R

3 Answers3

Linked