Splitting by two categories in R

Question

n <- 3
strata <- rep(1:4, each=n)
y <- rnorm(n =12)
x <- 1:12
category <- rep(c("A", "B", "C"), times = 4)
df <- cbind.data.frame(y, x, strata, category)

I want to first split my data into a list by "strata", and then I want to again split all the data frames inside the new list by "category". And finally I want to regress y on x inside each of the resulting data frames (in this case each data frame would be one row but in the actual data there are different lengths of each strata and a different number of categories inside strata).

`split(df, df[,c("strata","category")])` does the literal split into (here) 12 frames within one list. The names are `.`-delimited concatenations of the two variables; this can be useful if you need to reference specific strata/category pairs — r2evans, Aug 27 '21 at 14:45
Similar posts: https://stackoverflow.com/q/49121135/5325862, https://stackoverflow.com/q/32481107/5325862, https://stackoverflow.com/q/60962181/5325862, https://stackoverflow.com/q/19603239/5325862, https://stackoverflow.com/q/1169539/5325862 — camille, Aug 27 '21 at 17:01

r2evans · Accepted Answer · 2021-08-31T13:18:23.353

The canonical way in R is to use split:

L <- split(df, df[,c("strata","category")])
L
# $`1.A`
#           y x strata category
# 1 -1.120867 1      1        A
# $`2.A`
#           y x strata category
# 4 -1.023001 4      2        A
# $`3.A`
#           y x strata category
# 7 0.5411806 7      3        A
# $`4.A`
#           y  x strata category
# 10 1.546789 10      4        A
# $`1.B`
#           y x strata category
# 2 0.6730641 2      1        B
# $`2.B`
#           y x strata category
# 5 -1.466816 5      2        B
# $`3.B`
#            y x strata category
# 8 -0.1955617 8      3        B
# $`4.B`
#            y  x strata category
# 11 -0.660904 11      4        B
# $`1.C`
#            y x strata category
# 3 -0.9880206 3      1        C
# $`2.C`
#           y x strata category
# 6 0.4111802 6      2        C
# $`3.C`
#             y x strata category
# 9 -0.03311637 9      3        C
# $`4.C`
#            y  x strata category
# 12 0.6799109 12      4        C

The names of the 12-element list (here) are the string-concatenation of the two categorical variables, .-delimited; this can easily be overridden (manually).

From here, to do regression on every element, you'd likely do something like:

models <- lapply(L, function(x) lm(..., data=x))

(or whichever regression tool you are planning to use).

You can do this in one step if you'd like,

results <- by(df, df[,c("strata","category")], function(x) lm(..., data=x))

The benefit is that it does it in one step. The by return can look a bit odd, but it is really just a list with some special print.by methods being used; you can still reference it just like a list as needed.

Another way to do this in dplyr:

library(dplyr)
results <- df %>%
  group_by(strata, category) %>%
  summarize(model = list(lm(y ~ x)))
results
# # A tibble: 12 x 3
# # Groups:   strata [4]
#    strata category model 
#     <int> <chr>    <list>
#  1      1 A        <lm>  
#  2      1 B        <lm>  
#  3      1 C        <lm>  
#  4      2 A        <lm>  
#  5      2 B        <lm>  
#  6      2 C        <lm>  
#  7      3 A        <lm>  
#  8      3 B        <lm>  
#  9      3 C        <lm>  
# 10      4 A        <lm>  
# 11      4 B        <lm>  
# 12      4 C        <lm>  
results$model[[1]]
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept)            x  
#      -1.121           NA

As pointed out by Onyambu (thank you!), this works well (without data=) because we are explicitly listing the variables, and they will be found. If your regression uses ., for example, you may want to formalize it a little with

results <- df %>%
  group_by(strata, category) %>%
  summarize(model = list(lm(y ~ ., data = cur_data())))

y~x will work without it, but y~. will not, ergo data=cur_data().

This is really helpful, thank you! Follow up question: If I wanted to extract all the coefficients from "results" and put them in a data frame, what could I do? — Muhammad Kamil, Aug 27 '21 at 15:09
`pmap` is not the correct function to use here. THat is incorrect. That will iterate over `x` and `y`. You do not want the number of models to be equal to the number of rows of your dataset but rather you should have the number of models to be equal to the number of groups. What you need is `model = list(lm(y ~ x, cur_data()))` — Onyambu, Aug 27 '21 at 15:15
Thanks @Onyambu, I think it's fixed. (I don't know that `cur_data()` is strictly required, since the variables will be found in the environment just fine.) — r2evans, Aug 27 '21 at 15:32
@r2evans The notion of `cur_data` is to allow you use short hand formula. ie you could do `lm(y~., cur_data())` whenever there are many variables instead of listing all of them, but you cannot do `lm(y~.)`. Thats why i usually include it. — Onyambu, Aug 27 '21 at 17:40
Good point, I was going with the literal `y ~ x`. Your reasoning (`y ~ .`) makes perfect sense. Edited. — r2evans, Aug 27 '21 at 17:44

G. Grothendieck · Answer 2 · 2021-08-27T23:33:13.100

We have modified the input slightly to make the grouping columns factors and to provide more data per group in the Note at the end.

1) lmList Then we use lmList from nlme which can do regression by groups. We have used pool=FALSE but you can omit it and use the default of pool=TRUE depending on what you want. See ?lmList for details.

library(nlme)
fm <- lmList(y ~ x | g, transform(df, g = strata:category), pool = FALSE)
summary(fm)

giving:

Call:
  Model: y ~ x | g 
   Data: print(transform(df, g = strata:category)) 

Coefficients:
   (Intercept) 
      Estimate Std. Error    t value   Pr(>|t|)
1:A -0.6508622  0.7185126 -0.9058467 0.53142497
1:B -1.8043171  1.5367930 -1.1740794 0.44913412
2:A  4.0963651  1.4952687  2.7395512 0.22281400
2:B  3.5787230  0.1888514 18.9499422 0.03356368
   x 
       Estimate Std. Error     t value  Pr(>|t|)
1:A -0.03320257 0.21035895  -0.1578377 0.9003396
1:B  0.33526295 0.35569847   0.9425482 0.5188229
2:A -0.45361115 0.16347186  -2.7748577 0.2202010
2:B -0.35620163 0.01863826 -19.1113091 0.0332808

2) lm We can alternately formulate a single lm model with separate intercepts and slopes. This creates separate intercept and slope model matrix columns within each group. Look at model.matrix(y ~ (strata:category)/(x + 1) - 1, df) to get insight.

fm2 <- lm(y ~ (strata:category)/(x + 1) - 1, df)
summary(fm2)

giving:

Call:
lm(formula = y ~ (strata:category)/(x + 1) - 1, data = df)

Residuals:
       1        2        3        4        5        6        7        8 
 0.24290  0.41073 -0.48580 -0.82145  0.24290  0.41073  0.18876 -0.02152 
       9       10       11       12 
-0.37752  0.04304  0.18876 -0.02152 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
strata1:categoryA    -0.6509     0.7596  -0.857    0.440
strata2:categoryA     4.0964     2.0343   2.014    0.114
strata1:categoryB    -1.8043     0.9609  -1.878    0.134
strata2:categoryB     3.5787     2.2534   1.588    0.187
strata1:categoryA:x  -0.0332     0.2224  -0.149    0.889
strata2:categoryA:x  -0.4536     0.2224  -2.040    0.111
strata1:categoryB:x   0.3353     0.2224   1.507    0.206
strata2:categoryB:x  -0.3562     0.2224  -1.602    0.184

Residual standard error: 0.629 on 4 degrees of freedom
Multiple R-squared:  0.7886,    Adjusted R-squared:  0.3658 
F-statistic: 1.865 on 8 and 4 DF,  p-value: 0.2862

3) dplyr/broom

We can produce a list or tibble like this.

3a) group_map

library(broom)
library(dplyr)
df %>%
  group_by(strata, category) %>%
  group_map(~ tidy(lm(y ~ x, .)))

giving:

[[1]]
# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -0.651      0.719    -0.906   0.531
2 x            -0.0332     0.210    -0.158   0.900

[[2]]
# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -1.80      1.54     -1.17    0.449
2 x              0.335     0.356     0.943   0.519

[[3]]
# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    4.10      1.50       2.74   0.223
2 x             -0.454     0.163     -2.77   0.220

[[4]]
# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    3.58     0.189       18.9  0.0336
2 x             -0.356    0.0186     -19.1  0.0333

3b) group_modify

library(broom)
library(dplyr)

df %>%
  group_by(strata, category) %>%
  group_modify(~ tidy(lm(y ~ x, .))) %>%
  ungroup

giving:

# A tibble: 8 x 7
  strata category term        estimate std.error statistic p.value
  <fct>  <fct>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 1      A        (Intercept)  -0.651     0.719     -0.906  0.531 
2 1      A        x            -0.0332    0.210     -0.158  0.900 
3 1      B        (Intercept)  -1.80      1.54      -1.17   0.449 
4 1      B        x             0.335     0.356      0.943  0.519 
5 2      A        (Intercept)   4.10      1.50       2.74   0.223 
6 2      A        x            -0.454     0.163     -2.77   0.220 
7 2      B        (Intercept)   3.58      0.189     18.9    0.0336
8 2      B        x            -0.356     0.0186   -19.1    0.0333

4) listcompr Another approach is to use list comprehensions via the listcompr package. The first argument is a template for the component names, the second argument is the expression -- in this case the lm call, and the remaining arguments define the indexes s and c.

library(listcompr)
with(df, gen.named.list(str = "{s}.{c}", 
   expr = lm(y ~ x, subset = strata == s & category == c), 
   s = levels(strata), c = levels(category)))

giving this named list of lm objects:

$`1.A`

Call:
lm(formula = y ~ x, subset = strata == s & category == c)

Coefficients:
(Intercept)            x  
    -0.6509      -0.0332  


$`2.A`

Call:
lm(formula = y ~ x, subset = strata == s & category == c)

Coefficients:
(Intercept)            x  
     4.0964      -0.4536  


$`1.B`

Call:
lm(formula = y ~ x, subset = strata == s & category == c)

Coefficients:
(Intercept)            x  
    -1.8043       0.3353  


$`2.B`

Call:
lm(formula = y ~ x, subset = strata == s & category == c)

Coefficients:
(Intercept)            x  
     3.5787      -0.3562

If you dont mind, could you please elaborate abit on the 2nd option?. I would love to understand it abit — Onyambu, Aug 27 '21 at 17:44
Thanks. Just curious though, why decide to do `/(x+1)` why not just `x`. It seems both give the same model.matrix — Onyambu, Aug 27 '21 at 17:55
It makes it explicit that we want a slope and intercept but you could do it that way too if you prefer. — G. Grothendieck, Aug 27 '21 at 18:17

TarJae · Answer 3 · 2021-08-27T17:55:48.807

Update: Second part of the question: adding comment from Onyambu (many thanks):

If I understand you correctly you want to "regress y on x inside each of the resulting data frames" -> then:

With this code:

library(tidyverse)
library(broom)
df %>% 
    mutate(across(c(strata, category), as_factor)) %>% 
    group_by(category, strata) %>% 
    group_split() %>% 
    map_dfr(~tidy(lm(y ~ x, data = .)), .id = 'group')

We would get:

    group term        estimate std.error statistic p.value
   <chr> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 1     (Intercept)  -2.07         NaN       NaN     NaN
 2 1     x            NA             NA        NA      NA
 3 2     (Intercept)   0.0851       NaN       NaN     NaN
 4 2     x            NA             NA        NA      NA
 5 3     (Intercept)  -0.635        NaN       NaN     NaN
 6 3     x            NA             NA        NA      NA
 7 4     (Intercept)   0.948        NaN       NaN     NaN
 8 4     x            NA             NA        NA      NA
 9 5     (Intercept)   0.189        NaN       NaN     NaN
10 5     x            NA             NA        NA      NA
# ... with 14 more rows

1. part of the question:

We could use group_split function from dplyr:

it uses the grouping structure from group_by() and therefore is subject to the data mask
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.

https://dplyr.tidyverse.org/reference/group_split.html

library(dplyr)
df %>% 
    group_by(strata, category) %>% 
    group_split()

Output:


>[12]>
[[1]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 0.198     1      1 A       

[[2]]
# A tibble: 1 x 4
       y     x strata category
   <dbl> <int>  <int> <chr>   
1 -0.575     2      1 B       

[[3]]
# A tibble: 1 x 4
       y     x strata category
   <dbl> <int>  <int> <chr>   
1 -0.583     3      1 C       

[[4]]
# A tibble: 1 x 4
       y     x strata category
   <dbl> <int>  <int> <chr>   
1 0.0641     4      2 A       

[[5]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 -1.41     5      2 B       

[[6]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 0.162     6      2 C       

[[7]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1  1.18     7      3 A       

[[8]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 0.399     8      3 B       

[[9]]
# A tibble: 1 x 4
       y     x strata category
   <dbl> <int>  <int> <chr>   
1 -0.903     9      3 C       

[[10]]
# A tibble: 1 x 4
       y     x strata category
   <dbl> <int>  <int> <chr>   
1 -0.192    10      4 A       

[[11]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 -1.54    11      4 B       

[[12]]
# A tibble: 1 x 4
      y     x strata category
  <dbl> <int>  <int> <chr>   
1 -1.39    12      4 C

This is only half the question, since it doesn't address the regression, which is arguably more difficult than just splitting a data frame — camille, Aug 27 '21 at 16:53
@camille. Please see my update. The OP says: "And finally I want to regress y on x inside each of the resulting data frames (in this case each data frame would be one row but in the actual data there are different lengths of each strata and a different number of categories inside strata)." — TarJae, Aug 27 '21 at 17:14
You should consider adding the group otherwise one would not know the group unto which the coefficients belong to. ie consider doing `%>%map_dfr(~tidy(lm(y ~ x, data = .)), .id = 'group')` — Onyambu, Aug 27 '21 at 17:49
Thank you very much Onyambu for your time. I really appreciate. — TarJae, Aug 27 '21 at 17:54

Splitting by two categories in R

3 Answers3

Linked