How can I run all subset regression and get p-values of variables per each linear regression as dataframe using dplyr?

Question

How can I evaluate strings each row using dplyr::mutate?

I am 2-month newbie to R. I am practicing tidyverse to manipulate data and run statistics.

I am trying to run multiple linear regressions and get p-values of variables per each regression.

Here are reproducible samples;

require(tidyverse)

df <-
  tibble(serialNO = seq(1,10,1),
         lactate = c(1.3, 1.6, 2.6, 3.5, 1.2, 1.1, 3.6, 3, 1.9, 5.3),
         BMI = c(20, 27, 23, 25, 23, 23, 20, 24, 19, 23),
         Afib = c(0, 0, 1, 0, 0, 0, 1, 0, 0, 0),
         LVEF = c(65, 68, 61, 58, 57, 58, 25, 59, 66, 58))

# A tibble: 10 x 5
   serialNO lactate   BMI  Afib  LVEF
      <dbl>   <dbl> <dbl> <dbl> <dbl>
 1        1     1.3    20     0    65
 2        2     1.6    27     0    68
 3        3     2.6    23     1    61
 4        4     3.5    25     0    58
 5        5     1.2    23     0    57
 6        6     1.1    23     0    58
 7        7     3.6    20     1    25
 8        8     3      24     0    59
 9        9     1.9    19     0    66
10       10     5.3    23     0    58

Codes for multinomial linear regression are stored as strings each row, which looks like;

reg_com <- c("lm(lactate~sex+BMI+Afib, data=df)",
             "lm(lactate~sex+BMI+LVEF, data=df)",
             "lm(lactate~sex+Afib+LVEF, data=df)",
             "lm(lactate~BMI+Afib+LVEF, data=df)")

# A tibble: 4 x 1
  reg                               
  <chr>                             
1 lm(lactate~sex+BMI+Afib, data=df) 
2 lm(lactate~sex+BMI+LVEF, data=df) 
3 lm(lactate~sex+Afib+LVEF, data=df)
4 lm(lactate~BMI+Afib+LVEF, data=df)

What I want for result looks like this.

# A tibble: 4 x 5
  reg                                sex   BMI   Afib  LVEF 
  <chr>                              <chr> <chr> <chr> <chr>
1 lm(lactate~sex+BMI+Afib, data=df)  p     p     p     NA   
2 lm(lactate~sex+BMI+LVEF, data=df)  p     p     NA    p    
3 lm(lactate~sex+Afib+LVEF, data=df) p     NA    p     p    
4 lm(lactate~BMI+Afib+LVEF, data=df) NA    p     p     p

p in tibble are p-values of variables for each linear regression.

Since I spent the entire 2 days, I tried using 'for loop' , and I am getting error messages

reg_sum <- tibble(reg = as.character())
  
for(i in 1:length(reg_com)) {
  a <-
    df %>% 
    print(eval(parse(text=paste0(",reg_com[i],")))) %>%
    tidy %>%
    select(term, p.value) %>%
    column_to_rownames(var = "term") %>% # prepare for transpose
    t %>% 
    as_tibble %>%
    mutate(reg = reg_com[i])
  
  reg_sum <- full_join(reg_sum, a)
}

error: C stack usage  15923360 is too close to the limit

I am trying to do this because I need to perform more than 10k combinations of linear regressions.

I want to do it using dplyr if possible. (It's so cool!)

Please help me!

Welcome to SO. What have you tried so far and where have you looked for guidance? — Peter, May 13 '20 at 11:43
Can you update a minimum reproducible sample? Refer https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Mohanasundaram, May 13 '20 at 11:48
@Peter I updated my latest code. Before that, I tried using map(), which worked so far doing simple linear regressions. — bashmet, May 13 '20 at 12:27
@Mohanasundaram I updated reproducible sample. Thanks for advice! — bashmet, May 13 '20 at 12:28

score 0 · Answer 1 · answered Oct 03 '20 at 18:45

You are missing sex in the example dataset:

df <- tibble(serialNO = seq(1,10,1),
         lactate = c(1.3, 1.6, 2.6, 3.5, 1.2, 1.1, 3.6, 3, 1.9, 5.3),
         BMI = c(20, 27, 23, 25, 23, 23, 20, 24, 19, 23),
         Afib = c(0, 0, 1, 0, 0, 0, 1, 0, 0, 0),
         LVEF = c(65, 68, 61, 58, 57, 58, 25, 59, 66, 58),
         sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1))

We can set up the formulas:

reg_com <- c("lactate~sex+BMI+Afib",
             "lactate~sex+BMI+LVEF",
             "lactate~sex+Afib+LVEF",
             "lactate~BMI+Afib+LVEF")

Then we iterate through each, use tidy() from broom to get the coefficients in a data.frame format

library(dplyr)
library(purrr)
library(tidyr)
library(broom)

tibble(reg=reg_com) %>% 
mutate(results=map(reg,~tidy(lm(.,data=df)))) %>% 
unnest(results) 

# A tibble: 16 x 6
   reg                   term        estimate std.error statistic p.value
   <chr>                 <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 lactate~sex+BMI+Afib  (Intercept)   7.09      5.47     1.30      0.242
 2 lactate~sex+BMI+Afib  sex           2.45      1.36     1.80      0.122
 3 lactate~sex+BMI+Afib  BMI          -0.272     0.262   -1.04      0.339
 4 lactate~sex+BMI+Afib  Afib          1.86      1.20     1.55      0.172
 5 lactate~sex+BMI+LVEF  (Intercept)   7.38      5.79     1.28      0.249
 6 lactate~sex+BMI+LVEF  sex           1.46      1.26     1.16      0.291
 7 lactate~sex+BMI+LVEF  BMI          -0.120     0.278   -0.430     0.682
 8 lactate~sex+BMI+LVEF  LVEF         -0.0502    0.0397  -1.27      0.252
 9 lactate~sex+Afib+LVEF (Intercept)   3.76      3.12     1.21      0.274
10 lactate~sex+Afib+LVEF sex           1.34      0.987    1.36      0.223
11 lactate~sex+Afib+LVEF Afib          0.913     1.55     0.589     0.577
12 lactate~sex+Afib+LVEF LVEF         -0.0366    0.0483  -0.759     0.477
13 lactate~BMI+Afib+LVEF (Intercept)   2.93      5.41     0.541     0.608
14 lactate~BMI+Afib+LVEF BMI           0.109     0.216    0.505     0.632
15 lactate~BMI+Afib+LVEF Afib         -0.0120    1.54    -0.00778   0.994
16 lactate~BMI+Afib+LVEF LVEF         -0.0504    0.0549  -0.918     0.394

At this stage we are almost there, just need to filter for the terms we need and pivot into a wide format, so here's the complete code:

tibble(reg=reg_com) %>% 
mutate(results=map(reg,~tidy(lm(.,data=df)))) %>% 
unnest(results) %>% 
filter(term!="(Intercept)") %>%
select(reg,term,p.value) %>% 
pivot_wider(values_from=p.value,names_from=term)

# A tibble: 4 x 5
  reg                      sex    BMI   Afib   LVEF
  <chr>                  <dbl>  <dbl>  <dbl>  <dbl>
1 lactate~sex+BMI+Afib   0.122  0.339  0.172 NA    
2 lactate~sex+BMI+LVEF   0.291  0.682 NA      0.252
3 lactate~sex+Afib+LVEF  0.223 NA      0.577  0.477
4 lactate~BMI+Afib+LVEF NA      0.632  0.994  0.394

How can I run all subset regression and get p-values of variables per each linear regression as dataframe using dplyr?

1 Answers1