How to bootstrap a linear regression and estimate confidence intervals in R?

Question

I have data-set where the bootstraps were performed such that only values within the two main factors replicate/ level were replaced.

replicate level high.density low.density
    1     low    14          36
    1     low    54          31
    1     mid    82          10
    1     mid    24          NA
    2     low    12          28
    2     low    11          45
    2     mid    12          17
    2     mid    NA          24
    2      up    40          10
    2      up    NA           5
    2      up    20           2

##DATA FRAME

 df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
    
    df$replicate <- as.factor(as.numeric(df$replicate))
    df$level <- as.factor(as.character(df$level))

##Creating the data-set needed for boot. Only values with-in the unique replicate/ level were allowed to shuffle (Credits: Dion)

df_shuffle <- function(DF) {
  my_split <- split(DF, f = ~ DF$replicate + DF$level)
  shuffle <- lapply(my_split, function(x) {
    nrX <- nrow(x)
    cbind(x[, c('replicate', 'level')],
          high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
          low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
  })
  DF_new <- do.call(rbind, shuffle)
  rownames(DF_new) <- NULL
  return(DF_new)
}

B <- 1000
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
df_list <- lapply(df_list, function(x) x[complete.cases(x), ]) #choose complete cases

Now I want to bootstrap over these observations to estimate the coefficient, p-value and confidence interval. I am trying to replicate the boot function like in this example and draw the correct confidence interval like in this example (I only need the overall bootstrapped line and confidence interval)

#A sample code for the plot

df_boot <-  rbindlist(df_list, idcol = 'count')

ggplot(aes(x = low.density, y = high.density), data = df_boot) +  
  stat_smooth(aes(group = factor(count)), geom = "line", method = "lm", se = FALSE, color = "red", alpha=0.02) +
  stat_smooth(geom = "line", method = "lm", se = FALSE, color = "black", linetype = "longdash") +
  theme(panel.background = element_blank()) + theme(legend.position="none")

Dion Groothof · Answer 1 · 2022-01-28T22:36:07.450

We may conduct a non-parametric or parametric bootstrap of the postulated linear model. The important difference between a non-parametric and a parametric bootstrap is that in the non-parametric case, we are going to repeatedly sample from our original dataframe df, whereas in the parametric case, we simulate new data from our original model.

We postulate the following linear model:

high.density_i = b₀ + b₁low.density + e_i, e_i ~ N(0, sigma²)

It is important to first listwise delete rows containing missing observations and then run the bootstrap procedure (even though you'd better multiply impute missing observations in independents, because you will obtain biased inference if the missing data mechanism is not of the so-called MCAR type). In the post with the function df_shuffle() to which you refer in your question listwise deletion was performed after resampling the data. Performing listwise deletion before the resampling ensures that each bootstrap sample has the same number of rows as df. The is a prerequisite for the bootstrap to work and thus to be able to make valid inference based on the bootstrapping procedure.

The function boot_lm() allows the user to either perform a non-parametric or parametric bootstrap. It consists of the following arguments:

original_model: a character string specifying the postulated linear model.
original_data: a character string specifying the dataframe that was used to fit the postulated linear model.
type: a character string specifying whether a parametric (param) or non-parametric (ordinary) bootstrap should be performed.
B: the number of bootstrap samples to be taken.
seed: an integer to fix the random number generator.

# listwise deletion
df <- df[complete.cases(df), ]

# linear model to be bootstrapped
fm0 <- lm(high.density ~ low.density, data = df)

boot_lm <- function(original_data, original_model,
                    type = c('ordinary', 'param'),
                    B = 1000L, seed = 1) {
  set.seed(seed)
  betas_original_model <- coef(original_model)
  len_coef <- length(betas_original_model)
  mat <- matrix(rep(0L, B * len_coef), ncol = len_coef)
  if (type %in% 'ordinary') {
    n_rows <- length(residuals(original_model))
    for (i in seq_len(B)) {
      boot_dat <- original_data[sample(seq_len(n_rows), replace = TRUE), ]
      mat[i, ] <- coef(lm(terms(original_model), data = boot_dat))
    }
  }
  if (type %in% 'param') {
    X <- model.matrix(delete.response(terms(original_model)),
                      data = original_data)[, -1L]
    for (i in seq_len(B)) {
      mat[i, ] <- coef(lm(unlist(simulate(original_model)) ~ X,
                          data = original_data))
    }
  }
  confints <- matrix(rep(0L, 2L * len_coef), ncol = 2L)
  pvals <- numeric(len_coef)
  for (i in seq_len(len_coef)) {
    pvals[i] <- mean(abs(mat[, i] - mean(mat[, i])) > abs(betas_original_model[i]))
    confints[i, ] <- quantile(mat[, i], c(.025, 0.975))
  }
  names(pvals) <- names(betas_original_model)
  out <- data.frame(estimate = betas_original_model,
                    'lwr' = confints[, 1], 'upr' = confints[, 2],
                    p_value = pvals)
  return(out)
}

Output: your data

# non-parametric bootstrap
ordinary <- boot_lm(original_data = df, original_model = fm0,
                    type = 'ordinary', B = 1e4)

> ordinary
              estimate       lwr        upr p_value
(Intercept) 45.1522806 16.290080 88.6969733  0.0220
low.density -0.6492639 -2.055204  0.5368766  0.2792
# --------------------------------------------------------
# parametric bootstrap
param <- boot_lm(original_data = df, original_model = fm0,
                 type = 'param', B = 1e4)

> param
              estimate       lwr        upr p_value
(Intercept) 45.1522806 10.472075 79.1197394  0.0103
low.density -0.6492639 -1.971258  0.6381189  0.3245

Output: mtcars

# linear model to be bootstrapped
fm1 <- lm(mpg ~ wt + cyl + qsec, data = mtcars)

ordinary <- boot_lm(original_data = mtcars, original_model = fm1,
                    type = 'ordinary', B = 1e4)

> ordinary
              estimate        lwr        upr p_value
(Intercept) 29.4290521 13.8283579 41.2637258  0.0009
wt          -3.8616401 -6.6867159 -2.0884969  0.0084
cyl         -0.9277487 -1.9447741  0.4831599  0.1174
qsec         0.4944817 -0.1141793  1.3369213  0.1825

Thank you! My concern is that the first step here is not capturing the essence of my problem. When we do compelte.cases it removes the rows containing NAs. I don't want to lose that information. That is why in the previous question I wanted to create a data-set via bootstrap such that those values get shuffled within replicate/ level. For instance, for rep/ level: 1/mid one pairing in the data set would be 82 - 10 and in the other 24 - 10. In the method above I would never get te 24-10 combination. In the end those datasets should also contain 8 complete rows. — Rspacer, Jan 21 '22 at 14:19
Yes, I know the context well. But the stocastic nature of the resampling causes, just by change, the resampled data to sometimes have >2 NAs . Have a look yourself: `out <- do.call(rbind, lapply(df_list, \(x) apply(x[, c(3,4)], 2, \(i) sum(is.na(i))))); out`. We may also see how many times the resampled data actually contain >2 NAs per column: `apply(out, 2, \(x) table(x>2))`. This yields zero times for `low.density`, but 324 times for `high.density`. This makes sense, because `high.density` contained 2 NAs and `low.density` only one in your original data.frame `df`. — Dion Groothof, Jan 21 '22 at 14:38
Note that these numbers of times might fluctuate a bit, because I did not fix the random number generator (i.e. `set.seed()`). — Dion Groothof, Jan 21 '22 at 14:38

score 1 · Answer 2 · answered Jan 21 '22 at 19:56

It sounds like one of the big asks is to be able to plot the bootstrapped results. Here is one option:

First, make the data

df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
                                                                                                                                                                                                                                        NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))

df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level))

Next, make some hypothetical data that moves the variable of interest low.density across its range. Here we are doing it for all possible combinations of replicate and level, but you could also just pick one value for each.

hyp <- expand.grid(replicate = as.factor(c(1,2)), 
                   level=factor(1:3, labels=c("low", "mid", "up")), 
                   low.density = seq(min(df$low.density, na.rm=TRUE), 
                                     max(df$low.density, na.rm=TRUE), 
                                     length=25))

Then, we do the bootstrapping. In the function below, we draw the data, estimate the model and then generate predictions. If the model or predictions fail, then that particular model is thrown out and another is drawn until you get 1000 valid draws.

res <- NULL
i <- 1
while(i <= 1000){
  tmp <- df_shuffle(df)
  mod <- try(lm(high.density ~ low.density + replicate + level, data=tmp))
  if(!inherits(mod, "try-error")){
    pred <- try(predict(mod, newdata=hyp))
    if(!inherits(pred, "try-error")){
      res <- cbind(res, pred )
      i <- i+1
    }
  }
}

Estimate the original model to get the predicted values

orig <- lm(high.density ~ low.density + level + replicate, data=df)
hyp$fit <- predict(orig, newdata=hyp)

Calculate the quantile confidence intervals for each bootstrapped prediction and add them to the dataset.

cis <- t(apply(res, 1, quantile, c(.025,.975)))
hyp$lwr <- cis[,1]
hyp$upr <- cis[,2]

Finally, make the plot.

ggplot(hyp, aes(x=low.density, y=fit, 
                ymin = lwr, ymax=upr)) + 
  geom_ribbon(colour="transparent", alpha=.25) + 
  geom_line() + 
  facet_grid(replicate ~ level) + 
  theme_bw() + 
  theme(panel.grid=element_blank())

To get smoother bounds, try more bootstrap replicates.

Cool! But this approach does not solve the problem of having different numbers of cases across bootstrap samples, since `na.action` defaults to `na.omit`. The point estimates of `mod` are based on the output of calls to `df_shuffle()`. Out of 10k bootstrap samples, only 183 did not contain _any_ missing observations. — Dion Groothof, Jan 21 '22 at 20:45
It actually isn’t failing if there are any missing observations. It fails if there are no valid observations for one of the categories of replicate or level. I am not sure what behavior you would want otherwise. — DaveArmstrong, Jan 21 '22 at 20:47
Yes, I know that it isn't going to fail in the presence of missing observations, as it performs a listwise deletion by default. But that is I think where the catch is. Suppose that in the i-th iteration a call to the `summary()` method would state that `(3 observations deleted due to missingness)`. Then, in the i +1-th iteration, you would observe that `(2 observations deleted due to missingness)`. So, every fit will be based on different numbers of obervations, which is likely to yield biased inference (as this approach is only valid when the missing data mechanism is of the MCAR type). — Dion Groothof, Jan 21 '22 at 21:07
True, but that is not so different from the assumptions driving the normal theory result. Is your proposed alternative then to only use the bootstrapped datasets that have no missing observations? — DaveArmstrong, Jan 21 '22 at 21:21
In a sense, yes. But then the bootstrapped datasets will have more rows (N=11) than the original dataset `df` (n=8), at least in terms of the number of rows that are used in a call to `lm()` to fit `mod`. I think it would be more approriate to first listwise delete rows that contain missing observations, under the assumption that the data are MCAR (which likely does not hold true) and then perform the bootstrap. — Dion Groothof, Jan 21 '22 at 21:48

How to bootstrap a linear regression and estimate confidence intervals in R?

2 Answers2

Linked