How to run a regression for every combination of levels in two or more factor independent variables?

Question

I was wondering how to perform multiple independent linear regressions on a combination of levels for two or more factor variables.

Let's say our dataset has one dependent continuous variable, and then two factor independent variables and one continuous independent variable.

Then let's say our regression formula in r is this:

model <- lm(weight ~ city + diet + height)

Or, to write in pseudo code i'm trying to do this: lm(weight ~ height) %>% group by city lm(weight ~ height) %>% group by diet lm(weight ~ height) %>% group by city & diet

I know that we could run a linear regression for each city and diet one by one, but do you know of a way we could create a loop so that we do an independent regression for each city and diet in our dataset?

To illustrate this better I've made this fake dataset in this image and then listed the three types of outputs I would want. However, I don't want to manually do them one by one, but would rather use a loop.

Does anyone know how to do this in r?

Whilst the answers provide solutions to the question as asked, I think it is also worth pointing out that, from a purely statistical point of view, subset analyses of the type requested are almost always less efficient than an analysis of the full datsaset that includes the subsetting variables as independent terms in the fitted model. — Limey, Mar 06 '21 at 10:39
@Limey Thanks, but what do you mean by efficient in this case? From a purely statistical point of view are you saying this approach is less accurate or just slower/more expensive to compute? — user1992348, Mar 06 '21 at 19:53
No. I'm claiming that the standard errors of estimates and predictions from the pooled model will generally be smaller than for the correspionding quantities from the sub-group analyses. (Though execution time may well be less for the pooled analysis than n subset analyses as well.) — Limey, Mar 07 '21 at 08:07

score 5 · Answer 1 · answered Mar 06 '21 at 08:39

We can define the model specification in a list and then use lapply() over the list of desired models.

Code

models <- list("m1" = c("weight", "height"),
               "m2" = c("weight", "height", "city"),
               "m3" = c("weight", "height", "diet"),
               "m4" = c("weight", "height", "diet", "city"))

lapply(models, function(x){
  lm(weight ~ ., data = df[, x])
})

# $m1
# 
# Call:
# lm(formula = weight ~ ., data = df[, x])
# 
# Coefficients:
# (Intercept)       height  
#     -0.2970       0.1219  
#
#
# $m2
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
#
# Coefficients:
# (Intercept)       height  cityHouston  
#     -0.3705       0.1259       0.1205  
#
#
# $m3
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
# 
# Coefficients:
#    (Intercept)          height       dietVegan  dietVegetarian  
#        -0.1905          0.1270         -0.1288         -0.1757  
#
#
# $m4
#
# Call:
# lm(formula = weight ~ ., data = df[, x])
# 
# Coefficients:
#  (Intercept)          height       dietVegan  dietVegetarian     cityHouston  
#        -0.2615          0.1310         -0.1417         -0.1663          0.1197

Data

df <- data.frame("weight" = rnorm(100), 
           "height" = rexp(100),
           "diet" = as.factor(sample(c("Vegan", "Vegetarian", "Meat"), replace = TRUE, 100)),
           "city" = as.factor(sample(c("Houston", "Chicago"), replace = TRUE, 100)))

Thanks, but this is not quite what I was looking for. It's not that I want to see the coefficient for Houston when the intercept represents Chicago. It's that I want to see the coefficient for height when the city is houston or when the city is chicago... — user1992348, Mar 06 '21 at 21:18

jay.sf · Accepted Answer · 2021-03-06T22:38:49.093

First define a small regfun that computes the desired summary statistics. Then, using by apply it group-wise. For the combination of two groups we may ~~paste the columns together~~ use the interaction function : for factors.

regfun <- function(x) summary(lm(w ~ h, x))$coe[2, c(1, 4)]

do.call(rbind, by(d, d$city, regfun))
#     Estimate  Pr(>|t|)
# a -0.1879530 0.4374580
# b -0.2143780 0.4674864
# c -0.2866948 0.5131854

do.call(rbind, by(d, d$diet, regfun))
#     Estimate  Pr(>|t|)
# y -0.1997162 0.3412652
# z -0.3512349 0.4312766

# do.call(rbind, by(d, Reduce(paste, d[1:2]), regfun))
with(d, do.call(rbind, by(d, city:diet, regfun)))  ## credits to @G.Grothendieck
#       Estimate  Pr(>|t|)
# a y -0.2591764 0.5576043
# a z -0.1543536 0.8158689
# b y -0.1966501 0.7485405
# b z -0.4354839 0.7461538
# c y -0.5000000 0.3333333
# c z -1.0671642 0.7221495

Edit

If we have an unbalanced panel, i.e. with(d, city:diet) gives "impossible" combinations that aren't actually in the data, we have to code this slightly different. You can think of by as a combination of first split then lapply, so let's to that. Because we'll get errors, we may use tryCatch to provide a similar substitute.

s <- with(d2, split(d2, city:diet))
do.call(rbind, lapply(s, function(x) 
  tryCatch(regfun(x), 
           error=function(e) cbind.data.frame(Estimate=NA, `Pr(>|t|)`=NA))))
#       Estimate  Pr(>|t|)
# a:y -0.2591764 0.5576043
# a:z         NA        NA
# b:y  5.2500000       NaN
# b:z         NA        NA
# c:y -0.5000000 0.3333333
# c:z  9.5000000       NaN
# d:y         NA        NA
# d:z  1.4285714       NaN
# e:y         NA        NA
# e:z -7.0000000       NaN
# f:y         NA        NA
# f:z  2.0000000       NaN

Data:

d <- structure(list(city = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), diet = structure(c(1L, 1L, 1L, 
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("y", 
"z"), class = "factor"), id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), w = c(66L, 54L, 50L, 
74L, 59L, 53L, 67L, 75L, 66L, 64L, 73L, 56L, 53L, 74L, 54L, 63L, 
69L, 75L), h = c(152L, 190L, 174L, 176L, 185L, 186L, 180L, 194L, 
154L, 169L, 183L, 177L, 189L, 152L, 182L, 191L, 173L, 179L)), out.attrs = list(
    dim = c(city = 3L, diet = 2L, id = 3L), dimnames = list(city = c("city=a", 
    "city=b", "city=c"), diet = c("diet=y", "diet=z"), id = c("id=1", 
    "id=2", "id=3"))), row.names = c(NA, -18L), class = "data.frame")

d2 <- structure(list(city = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 
2L, 3L, 4L, 5L, 3L, 1L, 6L, 3L, 6L, 2L, 3L), .Label = c("a", 
"b", "c", "d", "e", "f"), class = "factor"), diet = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("y", "z"), class = "factor"), id = c(1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L
), w = c(66L, 54L, 50L, 74L, 59L, 53L, 67L, 75L, 66L, 64L, 73L, 
56L, 53L, 74L, 54L, 63L, 69L, 75L), h = c(152L, 190L, 174L, 176L, 
185L, 186L, 180L, 194L, 154L, 169L, 183L, 177L, 189L, 152L, 182L, 
191L, 173L, 179L)), out.attrs = list(dim = c(city = 3L, diet = 2L, 
id = 3L), dimnames = list(city = c("city=a", "city=b", "city=c"
), diet = c("diet=y", "diet=z"), id = c("id=1", "id=2", "id=3"
))), row.names = c(NA, -18L), class = "data.frame")

The last line of code could be written as `with(d, do.call(rbind, by(d, city:diet, regfun)))` — G. Grothendieck, Mar 06 '21 at 12:49
@G.Grothendieck Great stuff thanks! Are there `:`-methods that differentiate between sequences and factors? — jay.sf, Mar 06 '21 at 13:26
Thanks @jay.sf, so when I run this on my real data it works for d$city, but then when i run it for d$diet i get an error that says "Error in summary(lm(w ~ h : subscript out of bounds". My city column has like 6 factor levels whereas diet has like 20. Also, the last line also throw the "subscript out of bounds error as well" Any idea how to fix that? Thanks — user1992348, Mar 06 '21 at 21:15
@jay.sf I was able to partially reproduce the subscript out of bounds error that i have on my real data with your fake data. I just changed the city levels to this: `structure(list(city = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 3L, 1L, 6L, 3L, 6L, 2L, 3L), .Label = c("a","b", "c", "d", "e", "f")` and now when I run the last line for the city + diet combos i get the subscript out of bounds error. Do you know why that is and how to fix it? Thanks — user1992348, Mar 06 '21 at 21:59
@user1992348 I'm afraid I can't tell why "it works/it works not" with your real data. I believe I made a clear and small example that easily may be analyzed on how this kind of problem could be solved. Figure out what's the difference of the example data to your data. Try harder:) A problem is that you didn't make a [reproducible-example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — jay.sf, Mar 06 '21 at 21:59
@jay.sf I think you responded right before I added another comment saying how you can reproduce what I saw with your data but changing the city variable. LMK if this is sufficient to be able to reproduce it. Thanks — user1992348, Mar 06 '21 at 22:03
@jay.sf it won't let me paste the full code here, but I only changed the city part as shown above — user1992348, Mar 06 '21 at 22:04
@user1992348 I think the problem is that your panel isn't balanced. Try `with(d, city:diet)` which gives all the interactions and compare with actual combinations in modified data of your comment above. — jay.sf, Mar 06 '21 at 22:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229607/discussion-between-user1992348-and-jay-sf). — user1992348, Mar 06 '21 at 22:26
Please see edit to my answer @user1992348 and let me know if it helps. — jay.sf, Mar 06 '21 at 22:35
@jay.sf Yes, this seems to resolve the issue of the unbalanced panel. I see now that it was failing because there wasn't data for some comparisons. This is a great solution. I had one last question. Is it possible to also include the distinct number of ids used in each regression? So for example, when we do city a and diet y how many ids (distinct people) are there in the data set, and also what is the mean weight for those people? I tried doing it, but now with the new solution for unbalanced panel data I'm really confused on how to extract this info... Thanks a lot — user1992348, Mar 06 '21 at 22:53
@user1992348 Usually we take up an id dummy in the model, something like `lapply(with(d2, split(d2, city:diet)), function(x) tryCatch(summary(lm(w ~ h + factor(id), x))$coe[, c(1, 4)], error=function(e) cbind.data.frame(Estimate=NA, `Pr(>|t|)`=NA)))`. Probably it makes more sense with your data. — jay.sf, Mar 06 '21 at 23:21

score 1 · Answer 3 · answered Mar 07 '21 at 09:03

To expand on my comments about the efficiency of the pooled analysis over that of the subgroup analyses...

Using starwars as (a less than ideal) starting point:

d <- starwars %>% 
       filter(mass < 1000) %>%   # Exclude Jabba
       mutate(maleOrNot=ifelse(sex=="male", sex, "other")) %>% 
       replace_na(list(maleOrNot="other"))

For the sake of argument, say we want to regress a character's mass based only on whether they are male or not and their height and then obtain the standard error of the predicted mass at the mean height.

pData <- d %>% 
           group_by(maleOrNot) %>% 
           summarise(height=mean(height), .groups="drop")
pData

# A tibble: 2 x 2
  maleOrNot height
* <chr>      <dbl>
1 male        178.
2 other       162.

By group analyses

lapply(
  d %>% pull(maleOrNot) %>% unique(),
  function(x) {
    m <- lm(mass ~ height, d %>% filter(maleOrNot == x))
    predict(m, pData %>% filter(maleOrNot ==  x), se.fit=TRUE)$se.fit
  }
)
[[1]]
[1] 2.656427

[[2]]
[1] 5.855176

Now the pooled analysis:

m <- lm(mass ~ maleOrNot + height, d)
predict(m, pData, se.fit=TRUE)$se.fit

       1        2 
2.789770 4.945734

The prediction for the non-males is slightly (5%) less precise, but for males, precision is improved by 15.5%.

But the model isn't particularly good. Perhaps an interactioon model will improve things:

m <- lm(mass ~ maleOrNot:height, d)
predict(m, pData, se.fit=TRUE)$se.fit
       1        2 
2.776478 4.880154

Now the figures are 4.5% worse and 16.7% better. Including other terms in the model may well improve the precision even more.

In general terms (though there are exceptions), fitting a pooled model is unlikely to reduce precsion compared to fitting several subgroup models and can substantially improve precision. This is because all groups contribute to the estimation of the (common) variance.

In terms of computation time:

library(microbenchmark)

byGroup <- function() {
  lapply(
    d %>% pull(maleOrNot) %>% unique(),
    function(x) {
      m <- lm(mass ~ height, d %>% filter(maleOrNot == x))
      predict(m, pData %>% filter(maleOrNot ==  x), se.fit=TRUE)$se.fit
    }
  )
}

pooled <- function() {
  m <- lm(mass ~ maleOrNot + height, d)
  predict(m, pData, se.fit=TRUE)$se.fit
}

microbenchmark(byGroup, pooled, times=100)
Unit: nanoseconds
    expr min   lq  mean median uq  max neval
 byGroup  44 45.5 55.22     47 48  891   100
  pooled  42 44.0 60.27     46 47 1434   100

So for this simple case, there's virtually no difference. More complex examples may give different answers.

Thanks, this is really helpful to understand – user1992348 Mar 07 '21 at 21:53 — user1992348, Mar 07 '21 at 21:53

How to run a regression for every combination of levels in two or more factor independent variables?

3 Answers3

Edit