1

I have a data set of about 500 stocks with their returns. I was wondering if there was a way to generate multiple regression outputs, a 1 to 1 relationship, for a dependent variable to one independent?

For example, here is a simple regression output for one of the stocks MSFT compared to the market. Instead of creating multiple lines like the ones I have below, is there a way to put each of the 500 variables into the Y component of the formula and generate a new output for each? Maybe even go as far as put all this data into one table?

These are the variables that I am trying to group to ease my process: screenshot of variable table

regression_model <- lm(raw_data$MSFT~raw_data$VFINX, raw_data)
summary(regression_model)
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • 1
    this has *got* to be a duplicate ... but too lazy/hurried to go looking for it right now. – Ben Bolker Aug 11 '20 at 01:09
  • Please provide data [in a plain text format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that users can copy/paste, not as an image. – neilfws Aug 11 '20 at 01:19
  • 1
    here's one example https://stackoverflow.com/questions/61512343/many-regressions-using-tidyverse-and-broom-same-dependent-variable-different-i , https://stackoverflow.com/questions/59337879/most-efficient-way-to-run-regression-models-for-multiple-independent-variables-o . There are lots of variations on this (many dependent, many independent, all combinations of single indep & dep, etc – Ben Bolker Aug 11 '20 at 01:25

1 Answers1

1

One approach to this (which is sure to be a duplicate answer somewhere?) is:

  • reshape the data to long format
  • nest the data to create a column of tibbles containing the independent and dependent values
  • run each regression and store the results in a new column
  • tidy the regression output and unnest the data

Your example data is not usable, so let's use mtcars. Assume we want to predict mpg using each of disp, hp, drat, wt and qsec. First we select, gather and nest:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

mtcars %>% 
  select(mpg, disp, hp, drat, wt, qsec) %>% 
  gather(Var, Val, -mpg) %>% 
  nest(data = c(mpg, Val))

Result:

# A tibble: 5 x 2
  Var   data             
  <chr> <list>           
1 disp  <tibble [32 x 2]>
2 hp    <tibble [32 x 2]>
3 drat  <tibble [32 x 2]>
4 wt    <tibble [32 x 2]>
5 qsec  <tibble [32 x 2]>

Now we can map each row to a regression, and create the column of tidied output:

mtcars %>% 
  select(mpg, disp, hp, drat, wt, qsec) %>% 
  gather(Var, Val, -mpg) %>% 
  nest(data = c(mpg, Val)) %>% 
  mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
         tidied = map(model, tidy))

# A tibble: 5 x 4
  Var   data              model  tidied          
  <chr> <list>            <list> <list>          
1 disp  <tibble [32 x 2]> <lm>   <tibble [2 x 5]>
2 hp    <tibble [32 x 2]> <lm>   <tibble [2 x 5]>
3 drat  <tibble [32 x 2]> <lm>   <tibble [2 x 5]>
4 wt    <tibble [32 x 2]> <lm>   <tibble [2 x 5]>
5 qsec  <tibble [32 x 2]> <lm>   <tibble [2 x 5]>

And finally, select the columns we want and unnest:

mtcars %>% 
  select(mpg, disp, hp, drat, wt, qsec) %>% 
  gather(Var, Val, -mpg) %>% 
  nest(data = c(mpg, Val)) %>% 
  mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
         tidied = map(model, tidy)) %>% 
  select(-model, -data) %>% 
  unnest(cols = c(tidied))

Result:

# A tibble: 10 x 6
   Var   term        estimate std.error statistic  p.value
   <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 disp  (Intercept)  29.6      1.23       24.1   3.58e-21
 2 disp  Val          -0.0412   0.00471    -8.75  9.38e-10
 3 hp    (Intercept)  30.1      1.63       18.4   6.64e-18
 4 hp    Val          -0.0682   0.0101     -6.74  1.79e- 7
 5 drat  (Intercept)  -7.52     5.48       -1.37  1.80e- 1
 6 drat  Val           7.68     1.51        5.10  1.78e- 5
 7 wt    (Intercept)  37.3      1.88       19.9   8.24e-19
 8 wt    Val          -5.34     0.559      -9.56  1.29e-10
 9 qsec  (Intercept)  -5.11    10.0        -0.510 6.14e- 1
10 qsec  Val           1.41     0.559       2.53  1.71e- 2

You could add filters from dplyr::filter(), for example to remove the Intercept rows, or select a p-value threshold.

mtcars %>% 
  select(mpg, disp, hp, drat, wt, qsec) %>% 
  gather(Var, Val, -mpg) %>% 
  nest(data = c(mpg, Val)) %>% 
  mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
         tidied = map(model, tidy)) %>% 
  select(-model, -data) %>% 
  unnest(cols = c(tidied)) %>% 
  filter(p.value < 0.01,
         term != "(Intercept)")

# A tibble: 5 x 6
  Var   term  estimate std.error statistic  p.value
  <chr> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 disp  Val    -0.0412   0.00471     -8.75 9.38e-10
2 hp    Val    -0.0682   0.0101      -6.74 1.79e- 7
3 drat  Val     7.68     1.51         5.10 1.78e- 5
4 wt    Val    -5.34     0.559       -9.56 1.29e-10
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • I think that I am getting lost at the nest step. Is the data in the 5x2 tibble the data concerning each of the other variables (disp, hp, etc...)? The way I am thinking about it now is that the data for mpg is just "nested" away, hence it isn't shown and you're just giving it a name? Could you also explain the mutate step? I understand that you are defining the model here but the map function has me lost. Thank you for the help, this will save me a lot of time once I fully understand it –  Aug 13 '20 at 00:20
  • The first nest creates a tibble of 32 rows x 2 columns, where the columns are `mpg` and `Val`. `Val` is the value of the corresponding variable for that row (`disp`, `hp` etc.). So in effect, we are subsetting the data by each independent variable of interest. The 2 x 5 tibble is the output of the `tidy` function - it is a summary of the regression output. I would suggest running `tidy` on a single `lm` object to see how it works. `map`: basically the first one maps each row in the `data` column to the `lm` function. The second maps each row in the `model` column to the `tidy` function. – neilfws Aug 13 '20 at 00:31
  • There are quite a lot of methods to digest here, which is why I broke it into steps. Take your time and you may benefit from good tutorials or guides to this topic. It's covered in the [dplyr + broom vignette](https://cran.r-project.org/web/packages/broom/vignettes/broom_and_dplyr.html), also [this blog post](https://towardsdatascience.com/functional-programming-in-r-with-purrr-469e597d0229) and [this one](https://drsimonj.svbtle.com/running-a-model-on-separate-groups). – neilfws Aug 13 '20 at 00:39
  • I can't thank you enough, this is really great. I just need to visualize what it is doing I think to better understand it. So that tibble (the 5 x 2) if we break it down has 32 entries for the var right? but we are subsetting it? there are 32 entries and then there are 32 data observations for mpg and then the 32 values corresponding with disp? then it goes to hp where there are 32 entries of that and then 32 entries of mpg and then the corresponding values for value. essentially you're replacing the names of the cars with the variables and keep mpg? –  Aug 13 '20 at 00:57
  • Yes, you are understanding the 32 x 2 tibbles correctly. The 2 x 5 tibbles in the `tidied` column contain what you see when `unnest` is run: the columns `term`, `estimate`, `std.error`, `statistic` and `p.value`, and the rows `(Intercept)` and `Val` - where `Val` is the coefficient from `lm` for the variable of interest. Think of `data` as containing the data for each dependent/independent combination, `model` as containing each lm and `tidied` as containing each lm summary. – neilfws Aug 13 '20 at 01:11
  • Very interesting, I just watched a quick video on map. That's very very interesting, it's almost transposing the data into a vector so that we can make a mathematical function for every column instead of having to do it for every list. Just a couple more questions, I've read that mutate keeps old variables and creates the new ones but is the tilde in front of lm necessary? And why did you put data = . at the end? Is it possible to keep the model instead of selecting everything but them? Finally, how did you learn R? I am currently taking a data analytics course but want to know more. –  Aug 13 '20 at 01:28
  • Try it without the tilde and note the error message. It's required for `lm` to recognise that the arguments are column names. And "data = ." is shorthand for "data = the thing we are mapping" - i.e. the tibble in the `data` column. Yes, you can keep the `model` column as a list column in the final output. I learned R slowly and painfully beginning in about 2001, so you should not take my advice on that topic :) – neilfws Aug 13 '20 at 01:36
  • Hahahah I can't tell you how much I appreciate this, when I first saw your response I was like, "There's no way I can do this lol" but I really think this is awesome. I'll have to read more material on this stuff –  Aug 13 '20 at 01:46
  • I have come back more knowledgable than the last time. I am still having the upmost trouble understanding why that tilde needs to be in front of lm. I understand you're reasoning and I trust you but where is the documentation that ~ does what you say? I've been searching relentlessly to justify it in my mind but I can't. I understand why it DOESN'T work without it, but I don't understand why it does with it. –  Aug 28 '20 at 06:37
  • [Answered here I think](https://stackoverflow.com/questions/44834446/what-is-meaning-of-first-tilde-in-purrrmap) – neilfws Aug 28 '20 at 10:52