Hundreds of linear regressions that run by group in R

Question

I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.

I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:

library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg

I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:

Coefficients:
                (Intercept)          avgp
Adams 6           4.0073332            NA
Adams 7           6.5177389 -7.342443e+00
Adams 8           4.7449321            NA
Adams 9                  NA            NA

This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.

Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.

Thank you all!!

Sure would be nice to see some representative data. https://stackoverflow.com/help/how-to-ask, https://stackoverflow.com/questions/5963269/, and https://stackoverflow.com/help/mcve — r2evans, Jul 28 '18 at 04:03
If you want the p-value and r-squared (adjusted?) for all of your regressions, it isn't going to address the problem that you need to run each regression individually to be able to get them (ergo it will still take *"hours if not days"*). If you want a mathematical proof of why you cannot get these two summary statistics without actually running a full factorial of your models, perhaps you can ask the pedagogy of it at [Cross Validated](http://stats.stackexchange.com/). (Short-answer: you can't.) — r2evans, Jul 28 '18 at 04:10
Just use summary: https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/summary.lmList.html — Roland, Jul 28 '18 at 06:32

j3ypi · Accepted Answer · 2018-07-29T00:16:42.403

I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.

Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.

library(tidyverse)
library(broom)

df <- tibble(
  id = seq_len(90),
  IQ = rnorm(90, 100, 15),
  group = rep(c("A", "B", "C"), each = 30),
  time = rep(c("T1", "T2"), 45)
)

If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.

df %>% 
  nest(-group) %>% 
  mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
         results = map(fit, glance)) %>% 
  unnest(results) %>% 
  select(group, r.squared, p.value)

Which will return

 # A tibble: 3 x 3
  group r.squared p.value
  <chr>     <dbl>   <dbl>
1 A       0.0141    0.532
2 B       0.0681    0.164
3 C       0.00432   0.730

where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.

To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be

df %>% 
  nest(-group) %>% 
  mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
         results1 = map(fit, glance),
         results2 = map(fit, tidy)) %>% 
  unnest(results1) %>% 
  unnest(results2) %>% 
  select(group, term, estimate, r.squared, p.value) %>% 
  mutate(estimate = exp(estimate))

To exponentiate the slope, you can just add another mutate() statement. Finally it returns

# A tibble: 6 x 5
  group term        estimate r.squared p.value
  <chr> <chr>          <dbl>     <dbl>   <dbl>
1 A     (Intercept) 3.34e+46   0.0141    0.532
2 A     timeT2      3.31e- 2   0.0141    0.532
3 B     (Intercept) 1.17e+47   0.0681    0.164
4 B     timeT2      1.34e- 3   0.0681    0.164
5 C     (Intercept) 8.68e+43   0.00432   0.730
6 C     timeT2      1.25e- 1   0.00432   0.730

Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling

summary(lm(IQ ~ time, data = filter(df, group == "A")))

If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).

A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.

This makes the most sense to me. Everything runs fine, but when I create a subset of my data defined by some value of my group variable (say group "A"), and I run an lm function on that subset, my slope estimate comes out different than with this code. The r.squared and p.value are the same, however. Any idea what might be happening? Here is my code for the subset (I don't know how to format code in comments): `>df = lm(var2~var1, data=subset(data, id=="A")) >summary(df)` — mishelia, Jul 28 '18 at 18:34
@mishelia I accidently chose the F-statistic instead of the estimated slope. I updated the answer accordingly but it loses a bit of elegance, since we need to calculate two summary statistics. — j3ypi, Jul 29 '18 at 00:07
@ JPNolte. thank you for this useful code. It helps a lot. and I want to make sure a thing. suppose I have one dependent variable and 10 independent variable. if I want to regress dep VS each of the indep variable by group, how could I modify this code? thanks — R starter, May 11 '19 at 18:02
I'm not quiet sure what you mean. I suggest you ask a new question with a reproducable example. — j3ypi, May 12 '19 at 06:37

score 1 · Answer 2 · answered Jul 28 '18 at 04:26

1

If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.

library(dplyr)
mtcars %>% group_by(cyl) %>% 
    summarise_at(vars(disp:wt), funs(
        r.sqr = summary(lm(mpg~.))$r.squared,
        intercept = summary(lm(mpg~.))$coefficients[[1]],
        slope = summary(lm(mpg~.))$coefficients[[2]],
        p.value = summary(lm(mpg~.))$coefficients[[8]]
    ))

This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.

mtcars %>% group_by(cyl) %>% 
    summarise(
        r.sqr = summary(lm(mpg~wt))$r.squared,
        intercept = summary(lm(mpg~wt))$coefficients[[1]],
        slope = summary(lm(mpg~wt))$coefficients[[2]],
        p.value = summary(lm(mpg~wt))$coefficients[[8]]
    )

This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:

df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))

which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain

df$model[[1]]$coefficients[[1]]
[1] 39.5712

answered Jul 28 '18 at 04:26

AndS.

7,748
2
12
17

1

Running the same `lm(...)` four times to get four different statistics seems a bit ... inefficient. Can't you `nest() %>% mutate(mdl=map(dat,~lm(mpg~wt,data=.x)),summ=map(mdl,~summary(.x)),rsq=map(summ,~.x[["r.squared"]]),pv=map(summ,~.x$coefficients[2,4]))`? – r2evans Jul 28 '18 at 04:36
I apologize in advance because I'm just starting to use r, but what does 'summarise_at(vars(disp:wt), funs(' do? I tried running the second function and received the following error: 'Error in summarise_impl(.data, dots) : Evaluation error: 0 (non-NA) cases.' I also don't understand what this means. – mishelia Jul 28 '18 at 05:08
We do not have your data, so we cannot know for sure. When I do `mtcars$mpg<-NA;lm(mpg~wt,data=mtcars)` I get the same error, so do you know enough about your data to know if you are regressing on invariant/null data? – r2evans Jul 28 '18 at 06:03
@r2evans you are correct that the four lm(..) calls are probably inefficient. My new goal for the morning is to investigate how to incorporate map(...) into my own common work flows. Always gotta expand the toolbox.Thanks for the tip. – AndS. Jul 28 '18 at 13:15
1

Though I think the `broom` package can provide some utility, I find `tidyr::nest` and `purrr::map*` functions to be sufficient for dealing with this. Once you get used to encompassing any "thing" in a frame cell, it can become quite natural (for me, at least) to keep the model in one column, summary in another column, and derive statistics from those two cells. It's analogous (outside of frames) to `lst_of_summaries <- lapply(lst_of_models, summary)`. – r2evans Jul 28 '18 at 16:22

score 1 · Answer 3 · answered Jul 28 '18 at 07:46

While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:

Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":

#Getting the unique list of groups
groups <- unique(df$Group)

#Creating a model summary list to combine the model summary of each model
model_summaries = list()

#Running the models
for(i in 1:length(groups)){
  model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
  model_summaries[i] <- summary(model)
}

In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)

Let me know if this helps.

Expanding the list with each iteration is bad practice and slow. You should use `vector("list", length(groups))` when you create the list instead. — j3ypi, Jul 28 '18 at 12:52
If you look, this is actually the third option that I gave. One regression line and a list of regression results minus the loop. — AndS., Jul 28 '18 at 13:11

Hundreds of linear regressions that run by group in R

3 Answers3