Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column

Question

It might not be very clear from the title but what I wish to do is:

I have a dataframe df with, say, 200 columns and the first 80 columns are response variables (y1, y2, y3, ...) and the rest of 120 are predictors (x1, x2, x3, ...).
I wish to compute a linear model for each pair – lm(yi ~ xi, data = df).
Many problems and solutions I have looked through online have a either a fixed response vs many predictors or the other way around, using lapply() and its related functions.

Could anyone who is familiar with it point me to the right step?

Yuriy Saraykin · Accepted Answer · 2020-12-11T17:49:50.283

use tidyverse

library(tidyverse)
library(broom)
df <- mtcars
y <- names(df)[1:3]
x <- names(df)[4:7]

result <- expand_grid(x, y) %>% 
  rowwise() %>% 
  mutate(frm = list(reformulate(x, y)),
         model = list(lm(frm, data = df))) 

result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))

result$model[1:2]
#> $`mpg ~ hp`
#> 
#> Call:
#> lm(formula = frm, data = df)
#> 
#> Coefficients:
#> (Intercept)           hp  
#>    30.09886     -0.06823  
#> 
#> 
#> $`cyl ~ hp`
#> 
#> Call:
#> lm(formula = frm, data = df)
#> 
#> Coefficients:
#> (Intercept)           hp  
#>     3.00680      0.02168

map_df(result$model, tidy)
#> # A tibble: 24 x 5
#>    term        estimate std.error statistic  p.value
#>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)  30.1      1.63       18.4   6.64e-18
#>  2 hp           -0.0682   0.0101     -6.74  1.79e- 7
#>  3 (Intercept)   3.01     0.425       7.07  7.41e- 8
#>  4 hp            0.0217   0.00264     8.23  3.48e- 9
#>  5 (Intercept)  21.0     32.6         0.644 5.25e- 1
#>  6 hp            1.43     0.202       7.08  7.14e- 8
#>  7 (Intercept)  -7.52     5.48       -1.37  1.80e- 1
#>  8 drat          7.68     1.51        5.10  1.78e- 5
#>  9 (Intercept)  14.6      1.58        9.22  2.93e-10
#> 10 drat         -2.34     0.436      -5.37  8.24e- 6
#> # ... with 14 more rows

map_df(result$model, glance)
#> # A tibble: 12 x 12
#>    r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
#>        <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
#>  1     0.602         0.589   3.86     45.5  1.79e- 7     1  -87.6 181.  186. 
#>  2     0.693         0.683   1.01     67.7  3.48e- 9     1  -44.6  95.1  99.5
#>  3     0.626         0.613  77.1      50.1  7.14e- 8     1 -183.  373.  377. 
#>  4     0.464         0.446   4.49     26.0  1.78e- 5     1  -92.4 191.  195. 
#>  5     0.490         0.473   1.30     28.8  8.24e- 6     1  -52.7 111.  116. 
#>  6     0.504         0.488  88.7      30.5  5.28e- 6     1 -188.  382.  386. 
#>  7     0.753         0.745   3.05     91.4  1.29e-10     1  -80.0 166.  170. 
#>  8     0.612         0.599   1.13     47.4  1.22e- 7     1  -48.3 103.  107. 
#>  9     0.789         0.781  57.9     112.   1.22e-11     1 -174.  355.  359. 
#> 10     0.175         0.148   5.56      6.38 1.71e- 2     1  -99.3 205.  209. 
#> 11     0.350         0.328   1.46     16.1  3.66e- 4     1  -56.6 119.  124. 
#> 12     0.188         0.161 114.        6.95 1.31e- 2     1 -196.  398.  402. 
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

^{Created on 2020-12-11 by the reprex package (v0.3.0)}

Thanks! It's very helpful. I wonder if it is possible to generate the output in 2 separate matrices (80 x 120), one for the coefficient term and the other its associated p-values. — JujutsuR, Dec 11 '20 at 17:20
I was glad to help. Try it this way `map_df(result$model, tidy)` or `map_df(result$model, glance)` — Yuriy Saraykin, Dec 11 '20 at 17:40
Both returned as 'Error: No tidy method for objects of class summary.lm' — JujutsuR, Dec 11 '20 at 17:45
I have the library(broom) ready, so not sure what went wrong — JujutsuR, Dec 11 '20 at 17:47
Ah I have found my mistake. I have added summary(lm(...)) in the function earlier — JujutsuR, Dec 11 '20 at 17:52
Hi Yuriy, with the 'map_df()' function the response variable annotation is lost (e.g. mpg, cyl etc.). Is there a way to associate it back your (24 x 5) tibble and so I could use 'pivot_wider()' to reshape the data into the wide format I want? — JujutsuR, Dec 11 '20 at 19:23
solved. In case anyone is looking through the section, first pull out the response variables (# of rows = R) as a column and use `cbind()` to add that column to the output result (of different lengths, i.e. different # of rows, say W) and R would recycle the response variables over and over, as long as the W is a multiple of R. — JujutsuR, Dec 12 '20 at 12:38

Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column

1 Answers1

Linked