Loop over columns in a data.frame, extract model beta and SE into a new data.frame

Question

I raised a similar question previously which was adequately answered Nested loop with simple linear regression (Thank you!)

This time, I am seeking a method to loop over columns in a data frame and save the output in a new data frame where each row of the new output data frame represents one iteration of the loop (namely a column from the original data frame we are looping over).

Here is my attempt so far:

library(tidyverse)
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
                  q=runif(n, min=45, max=85),
                  r=runif(n, min=2.4, max=6.0),
                  s=runif(n, min=24, max=60),
                  t=runif(n, min=0.28, max=1.73))
vd <- runif(n, min=15, max=125)

my_models <- list()
x <- 1
for (i in 1:colnames(dat)){ 
    model <- lm(paste(dat[[i]], "~", vd, "-1"))
    my_models[[x]]<-data.frame(ModelNo=i,coefficients(summary(model)))
    x <- x+1
  } 
output <- do.call(rbind,my_models)

reg.tbl <- output %>% 
  mutate(beta=(round(exp(Estimate), digits=3)),
         lower=(round(exp(Estimate-1.96*Std..Error), digits=3)),
         upper=(round(exp(Estimate+1.96*Std..Error), digits=3))) %>% 
  select(-Estimate, -Std..Error, -t.value, -Pr...t..) %>% 
  mutate(coef=paste(beta, " ", "(", lower, " ", "to", " ", upper, ")")) %>% 
  mutate(outcome=if_else(ModelNo==1, "MI", "Stroke")) %>% 
  select(-ModelNo, -beta, -lower, -upper)

The reason I have not applied the previous solution to my current data is that I have over 40 variables to run regressions on, and saving each as an object and then into a 'predictorlist' as per my previous solution is time consuming and defeats the purpose of trying to loop over something to automate the analysis. The very reason I switched over to R ~18 months ago.

Any assistance is, naturally, greatly appreciated.

I don't understand why you want the final `reg.tbl` coefs and CI's as character strings, wouldn't it make more sense to have a table of numeric vectors? — Rui Barradas, Mar 07 '22 at 08:27
@RuiBarradas thanks for providing a response. Agree, it is more useful to save as numeric vectors. Can this be done? I guess these data represent output I will paste into a manuscript. — Sandro, Mar 07 '22 at 10:56

Rui Barradas · Answer 1 · 2022-03-07T12:31:39.923

I don't understand why you want the final result coefficients and CI's as character strings but the code below produces them like that.

I have changed the for loop, the question's didn't work;
the regression formulas are created with reformulate, not with paste;
in the pipe, the numbers are not rounded, sprintf takes care of that.

suppressPackageStartupMessages(library(tidyverse))
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
                  q=runif(n, min=45, max=85),
                  r=runif(n, min=2.4, max=6.0),
                  s=runif(n, min=24, max=60),
                  t=runif(n, min=0.28, max=1.73))
vd <- runif(n, min=15, max=125)

my_models <- vector("list", length = ncol(dat) - 1L)
for(i in seq_along(names(dat))[-1]){
  resp <- names(dat)[i]
  fmla <- reformulate(termlabels = c(-1, "vd"), response = resp)
  model <- lm(fmla, data = dat)
  smry <- summary(model)
  my_models[[i - 1L]] <- data.frame(ModelNo = i - 1L, coefficients(smry), check.names = FALSE)
} 
output <- do.call(rbind, my_models)
row.names(output) <- NULL

output %>% 
  mutate(beta = exp(Estimate),
         lower = exp(Estimate - 1.96*`Std. Error`),
         upper = exp(Estimate + 1.96*`Std. Error`)) %>% 
  select(-Estimate, -`Std. Error`, -`t value`, -`Pr(>|t|)`) %>% 
  mutate(coef = sprintf("%.03g (%.03g to %.03g)", beta, lower, upper)) %>% 
  mutate(outcome = if_else(ModelNo == 1, "MI", "Stroke")) %>% 
  select(-ModelNo, -beta, -lower, -upper)
#>                  coef outcome
#> 1 2.13 (2.08 to 2.18)      MI
#> 2 1.05 (1.05 to 1.05)  Stroke
#> 3 1.64 (1.61 to 1.66)  Stroke
#> 4 1.01 (1.01 to 1.01)  Stroke

^{Created on 2022-03-07 by the reprex package (v2.0.1)}

Edit

Here is another way. At first it seems more complicated but it simplifies the main loop, making it a simple purrr::map call.

Write a regression function reg_fun taking care of everything, the regression and the computation of the confidence intervals lower and upper bounds.
Then have it return an S3 class object sub-classing class "data.frame". I have called this custom class "Sandro", feel free to change this.
The custom class serves a purpose, to have its own print method. Like this the regression's return value will still be numbers and can be used extracting them with the standard extraction operators, but they will print in the desired format.

Then call purrr::map_dfr in a much simpler pipe.

suppressPackageStartupMessages(library(tidyverse))

set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
                  q=runif(n, min=45, max=85),
                  r=runif(n, min=2.4, max=6.0),
                  s=runif(n, min=24, max=60),
                  t=runif(n, min=0.28, max=1.73))
vd <- runif(n, min=15, max=125)

reg_fun <- function(y, x, data, conf = 0.95){
  alpha <- qnorm(1 - (1 - conf)/2)
  fit <- lm(y ~ 0 + x, data = data)
  smry <- summary(fit)
  estimate <- coef(smry)[, 1]
  se <- coef(smry)[, 2]
  out <- data.frame(
    beta = exp(estimate),
    lower = exp(estimate - alpha * se),
    upper = exp(estimate + alpha * se)
  )
  class(out) <- c("Sandro", class(out))
  out
}

# custom print method
print.Sandro <- function(x, digits = 3){
  outcome <- rep("Stroke", nrow(x))
  outcome[1] <- "MI"
  fmt <- paste0("%.0", digits, "g")
  fmt <- paste0(fmt, " (", fmt, " to ", fmt, ")")
  out <- data.frame(
    coef = sprintf(fmt, x[["beta"]], x[["lower"]], x[["upper"]]),
    outcome = outcome
  )
  print.data.frame(out)
}

dat %>%
  select(-id) %>%
  map_dfr(reg_fun, x = vd, data = dat)
#>                  coef outcome
#> 1 2.13 (2.08 to 2.18)      MI
#> 2 1.05 (1.05 to 1.05)  Stroke
#> 3 1.64 (1.61 to 1.66)  Stroke
#> 4 1.01 (1.01 to 1.01)  Stroke

^{Created on 2022-03-07 by the reprex package (v2.0.1)}

Edit 2

To have the responses also print, change the print method and the pipe to the following.

print.Sandro <- function(x, digits = 3){
  fmt <- paste0("%.0", digits, "g")
  fmt <- paste0(fmt, " (", fmt, " to ", fmt, ")")
  out <- data.frame(
    coef = sprintf(fmt, x[["beta"]], x[["lower"]], x[["upper"]])
  )
  print.data.frame(out)
}

dat %>%
  select(-id) %>%
  map_dfr(reg_fun, x = vd, data = dat) %>%
  mutate(outcome = names(dat)[-1]) %>% 
  relocate(outcome, .before = "beta")

And to see the return data.frame with the coefficients, lower and upper bounds, end the pipe with a call to as.data.frame.

dat %>%
  select(-id) %>%
  map_dfr(reg_fun, x = vd, data = dat) %>%
  mutate(outcome = names(dat)[-1]) %>% 
  relocate(outcome, .before = "beta") %>%
  as.data.frame()
#>   outcome     beta    lower    upper
#> 1       q 2.128118 2.079911 2.177443
#> 2       r 1.050421 1.048759 1.052085
#> 3       s 1.638222 1.612149 1.664716
#> 4       t 1.012015 1.011534 1.012497

thanks again for these solutions. I attempted to apply this code to my data and encountered the following error message `Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' In addition: Warning message: In storage.mode(v) <- "double" : NAs introduced by coercion` I am unsure what this means and how to go about resolving it. Any advice is appreciated. — Sandro, Mar 07 '22 at 10:58
Further, where I specify `outcome` in the output table, is it possible to have the outcome labels (i.e. row names) as the column labels from the df we are looping over? — Sandro, Mar 07 '22 at 11:00
@Sandro I am unable to reproduce the error, you will have to post a data set more representative of your real data. — Rui Barradas, Mar 09 '22 at 05:03
thank you, I will have to find a way to produce a MRE from my actual data without breaching the data sharing agreement — Sandro, Mar 10 '22 at 04:57

Loop over columns in a data.frame, extract model beta and SE into a new data.frame

1 Answers1

Edit

Edit 2