1

Reference is made towards Fit model using each predictor columns indiviually store results in dataframe, where a dataframe consists one column of a response variable and several columns of predictor variables. The author wished to fit models for the response variable using each of the predictor variables separately, finally creating a dataframe that contains the coefficients of the model. There's an answer https://stackoverflow.com/a/43959567/14435732 down the original question which interests me (copied below).

require(tibble)
require(dplyr)
require(tidyr)
require(purrr)
require(broom)

df <- iris
response_var <- "Sepal.Length"

vars <- tibble(response=response_var,
               predictor=setdiff(names(df), response_var))

compose_formula <- function(x, y)
  as.formula(paste0("~lm(", y, "~", x, ", data=.)"))

models <- tibble(data=list(df)) %>%
           crossing(vars) %>%
           mutate(fmla = map2(predictor, response, compose_formula),
                  model = map2(data, fmla, ~at_depth(.x, 0, .y)))

models %>% unnest(map(model, tidy))

My question is slightly different: now I have a dataframe with several columns of response variables (say Sepal.Length and Sepal.Width) and several columns of predictor variables (say Petal.Length, Petal.Width and Species) (see my first question from Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column). I got a very helpful answer but it would be perfect if I could have kept the names for response and predictor in the formula of the model object.

P.S: I have tried modifying the codes from https://stackoverflow.com/a/43959567/14435732 but encountered several issues:

  1. When I tried tibble() for vars (now that my response_var has several columns), this happens: 'Error: Tibble columns must have compatible sizes.'
  2. at_depth() is defunct.

Is there a way to get a desired output like below? (copied from https://stackoverflow.com/a/43959567/14435732)

# A tibble: 9 x 7
      response    predictor              term   estimate  std.error statistic
         <chr>        <chr>             <chr>      <dbl>      <dbl>     <dbl>
1 Sepal.Length  Sepal.Width       (Intercept)  6.5262226 0.47889634 13.627631
2 Sepal.Length  Sepal.Width       Sepal.Width -0.2233611 0.15508093 -1.440287
3 Sepal.Length Petal.Length       (Intercept)  4.3066034 0.07838896 54.938900
4 Sepal.Length Petal.Length      Petal.Length  0.4089223 0.01889134 21.646019
5 Sepal.Length  Petal.Width       (Intercept)  4.7776294 0.07293476 65.505517
6 Sepal.Length  Petal.Width       Petal.Width  0.8885803 0.05137355 17.296454
7 Sepal.Length      Species       (Intercept)  5.0060000 0.07280222 68.761639
8 Sepal.Length      Species Speciesversicolor  0.9300000 0.10295789  9.032819
9 Sepal.Length      Species  Speciesvirginica  1.5820000 0.10295789 15.365506
# ... with 1 more variables: p.value <dbl>
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
JujutsuR
  • 81
  • 6

1 Answers1

0

You can use a multi response linear model, where each response is regressed against each predictor separately, so for example:

lm(cbind(Sepal.Length,Sepal.Width) ~ Species,data=iris)

Call:
lm(formula = cbind(Sepal.Length, Sepal.Width) ~ Species, data = iris)

Coefficients:
                   Sepal.Length  Sepal.Width
(Intercept)         5.006         3.428     
Speciesversicolor   0.930        -0.658     
Speciesvirginica    1.582        -0.454 

You get two sets of predictors and tidy does a good job on this:

tidy(lm(cbind(Sepal.Length,Sepal.Width) ~ Species,data=iris))
# A tibble: 6 x 6
  response     term              estimate std.error statistic   p.value
  <chr>        <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 Sepal.Length (Intercept)          5.01     0.0728     68.8  1.13e-113
2 Sepal.Length Speciesversicolor    0.93     0.103       9.03 8.77e- 16
3 Sepal.Length Speciesvirginica     1.58     0.103      15.4  2.21e- 32
4 Sepal.Width  (Intercept)          3.43     0.0480     71.4  5.71e-116
5 Sepal.Width  Speciesversicolor   -0.658    0.0679     -9.69 1.83e- 17
6 Sepal.Width  Speciesvirginica    -0.454    0.0679     -6.68 4.54e- 10  

Now we just need to repeat the above, changing the formula on the RHS , so we can do this by using reformulate, for example:

reformulate(termlabels="Species",response="cbind(Sepal.Length,Sepal.Width)")

cbind(Sepal.Length, Sepal.Width) ~ Species

We write a function to do this, and adding an additional column for the predictor:

library(purrr)
library(dplyr)

tidyMLM = function(iv,dat){

    f = reformulate(termlabels=iv,
    response="cbind(Sepal.Length,Sepal.Width)")
    
    tidy(lm(f,data=dat)) %>% mutate(predictor=iv,.after=response)
}

And use the amazing purrr (gone are the days of lapply..):

predictors = setdiff(colnames(iris),c("Sepal.Length","Sepal.Width"))

map_dfr(predictors,tidyMLM,dat=iris)
# A tibble: 14 x 7
   response    predictor    term          estimate std.error statistic   p.value
   <chr>       <chr>        <chr>            <dbl>     <dbl>     <dbl>     <dbl>
 1 Sepal.Leng… Petal.Length (Intercept)      4.31     0.0784     54.9  2.43e-100
 2 Sepal.Leng… Petal.Length Petal.Length     0.409    0.0189     21.6  1.04e- 47
 3 Sepal.Width Petal.Length (Intercept)      3.45     0.0761     45.4  9.02e- 89
 4 Sepal.Width Petal.Length Petal.Length    -0.106    0.0183     -5.77 4.51e-  8
 5 Sepal.Leng… Petal.Width  (Intercept)      4.78     0.0729     65.5  3.34e-111
 6 Sepal.Leng… Petal.Width  Petal.Width      0.889    0.0514     17.3  2.33e- 37
 7 Sepal.Width Petal.Width  (Intercept)      3.31     0.0621     53.3  1.84e- 98
 8 Sepal.Width Petal.Width  Petal.Width     -0.209    0.0437     -4.79 4.07e-  6
 9 Sepal.Leng… Species      (Intercept)      5.01     0.0728     68.8  1.13e-113
10 Sepal.Leng… Species      Speciesversi…    0.93     0.103       9.03 8.77e- 16
11 Sepal.Leng… Species      Speciesvirgi…    1.58     0.103      15.4  2.21e- 32
12 Sepal.Width Species      (Intercept)      3.43     0.0480     71.4  5.71e-116
13 Sepal.Width Species      Speciesversi…   -0.658    0.0679     -9.69 1.83e- 17
14 Sepal.Width Species      Speciesvirgi…   -0.454    0.0679     -6.68 4.54e- 10

If you have say 80 responses, 120 predictors:

df = data.frame(matrix(rnorm(100*200),ncol=200))
colnames(df) = c(paste0("r",1:80),paste0("p",1:120))
responses = as.matrix(df[,grep("r",colnames(df))])
predictors = grep("p",colnames(df),value=TRUE)

tidyMLM = function(iv,dat){
    
        f = reformulate(termlabels=iv,
        response="responses")
        
        tidy(lm(f,data=dat)) %>% mutate(predictor=iv,.after=response)
    }

map_dfr(predictors,tidyMLM,dat=df)

# A tibble: 19,200 x 7
   response predictor term        estimate std.error statistic p.value
   <chr>    <chr>     <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 r1       p1        (Intercept)  -0.0959    0.0882    -1.09   0.280 
 2 r1       p1        p1            0.0217    0.0879     0.247  0.806 
 3 r2       p1        (Intercept)  -0.0174    0.101     -0.172  0.864 
 4 r2       p1        p1           -0.186     0.101     -1.84   0.0685
 5 r3       p1        (Intercept)   0.0214    0.0947     0.226  0.822 
 6 r3       p1        p1            0.0651    0.0944     0.690  0.492 
 7 r4       p1        (Intercept)   0.0708    0.103      0.686  0.495 
 8 r4       p1        p1           -0.142     0.103     -1.38   0.169 
 9 r5       p1        (Intercept)   0.134     0.101      1.33   0.186 
10 r5       p1        p1            0.0248    0.100      0.247  0.805 
# … with 19,190 more rows

I hope this makes sense now.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks for your response; perhaps I wasn't very clear on my question: essentially what I need is to perform a linear model on one response variable to one predictor at a time, say I have 80 response variables and 120 predictors I will need to generate 80 x 120 linear models in total. Ideally [https://stackoverflow.com/a/65255383/14435732] Yuriy's solution is what I desire, it would be perfect if in his output I will also have the associated response variables so that I won't have to key them in manually. Could you share your insights? – JujutsuR Dec 12 '20 at 10:07
  • yes and what I am saying is, you can simplify it by fitting all your 80 responses at one go by providing a matrix on the RHS of the lm formula. I don't see at which part am i keying in things so called manually apart from defining your responses – StupidWolf Dec 12 '20 at 10:10
  • I see, apologies on my side as I wasn't sure whether output from the multi response linear model would be different from doing a simple linear modelling. – JujutsuR Dec 12 '20 at 11:19