Performing multiple different aggregations on same data in dplyr

Question

I see this comes up in python, panda, and other areas, but I see no help with dplyr for this problem.

require(dplyr) 
set.seed(1234L)
i <- 1:80 # years 1 thru 80

S1 <- .02*i  # a trend line
S2 <- S1[i] + rnorm(n=80,mean=0,sd=.2) # same line w/ error added

x <- tibble(S1,
            S2, 
            yrs10 = rep(1:8,each=10),   # to aggregate by decades (8 groups)
            yrs16 = rep(1:5,each = 16), # by 16 yr spans (5 groups)
            yrs20 = rep(1:4,each = 20), # by 20 yr spans (4 groups)
            yrs40 = rep(1:2,each = 40)  # by 40 yr spans (2 groups)
            )

This is my problem: I cannot find a way to feed repeated aggregations into the group_by statement without declaring 4 different vars and repeating all the code.

x10 <- x %>%
     group_by(yrs10) %>%
     summarize(cor=cor(S1,S2)^2) # squared to show prop of var accounted for

x16 <- x %>%
     group_by(yrs16) %>%
     summarize(cor=cor(S1,S2)^2) # squared to show prop of var accounted for

x20 <- x %>%
     group_by(yrs20) %>%
     summarize(cor=cor(S1,S2)^2) # squared to show prop of var accounted for

x40 <-x %>%
     group_by(yrs40) %>%
     summarize(cor=cor(S1,S2)^2) # squared to show prop of var accounted for

x10
x16
x20
x40

There should be a way to feed yrs10 thru yrs40 as a list and collect these results as a list but all I get are errors when I try. The dplyr answers I see here don't seem to cover this scenario. What am I missing?

langtang · Answer 1 · 2022-05-09T17:54:49.410

You can pivot longer, and do on each group

x %>% pivot_longer(cols = starts_with("yrs")) %>% 
  group_by(name,value) %>% 
  summarize(cor = cor(S1,S2)^2)

Output:

   name  value     cor
   <chr> <int>   <dbl>
 1 yrs10     1 0.0378 
 2 yrs10     2 0.364  
 3 yrs10     3 0.0371 
 4 yrs10     4 0.00845
 5 yrs10     5 0.00178
 6 yrs10     6 0.420  
 7 yrs10     7 0.0386 
 8 yrs10     8 0.0281 
 9 yrs16     1 0.294  
10 yrs16     2 0.181  
11 yrs16     3 0.317  
12 yrs16     4 0.467  
13 yrs16     5 0.0289 
14 yrs20     1 0.386  
15 yrs20     2 0.135  
16 yrs20     3 0.403  
17 yrs20     4 0.0556 
18 yrs40     1 0.554  
19 yrs40     2 0.660

and, if you want it back in wide format, simply add:

... %>%
pivot_wider(id_cols=value, names_from=name, values_from = cor)

Output:

  value   yrs10   yrs16   yrs20  yrs40
  <int>   <dbl>   <dbl>   <dbl>  <dbl>
1     1 0.0378   0.294   0.386   0.554
2     2 0.364    0.181   0.135   0.660
3     3 0.0371   0.317   0.403  NA    
4     4 0.00845  0.467   0.0556 NA    
5     5 0.00178  0.0289 NA      NA    
6     6 0.420   NA      NA      NA    
7     7 0.0386  NA      NA      NA    
8     8 0.0281  NA      NA      NA

Finally, could also do something like this, if you wanted each one in a list

f <- function(d,g) {
  d %>% group_by(across(all_of(g))) %>% summarize(cor=cor(S1,S2)^2)
}
lapply(colnames(x)[3:6], function(n) f(x,n))

Output:

[[1]]
# A tibble: 8 x 2
  yrs10     cor
  <int>   <dbl>
1     1 0.0378 
2     2 0.364  
3     3 0.0371 
4     4 0.00845
5     5 0.00178
6     6 0.420  
7     7 0.0386 
8     8 0.0281 

[[2]]
# A tibble: 5 x 2
  yrs16    cor
  <int>  <dbl>
1     1 0.294 
2     2 0.181 
3     3 0.317 
4     4 0.467 
5     5 0.0289

[[3]]
# A tibble: 4 x 2
  yrs20    cor
  <int>  <dbl>
1     1 0.386 
2     2 0.135 
3     3 0.403 
4     4 0.0556

[[4]]
# A tibble: 2 x 2
  yrs40   cor
  <int> <dbl>
1     1 0.554
2     2 0.660

Thanks. Especially the pivot_wider call you added. I had not considered pivoting. I usually go the various x_apply routes and was trying to feed the various columns into a single function which ran into all sorts of issues. — John Garland, May 09 '22 at 17:49
@JohnGarland, you can also do this in functional form as in my edit - or as Andrew as provided below nicely, using `map()` — langtang, May 09 '22 at 17:53

AndrewGB · Answer 2 · 2022-05-09T18:06:43.417

Another option if you are wanting to return a list is to use map, where we can feed the column names into the group_by in the function.

library(tidyverse)

map(names(x[-c(1:2)]), ~
  x %>%
    group_by(.data[[.x]]) %>%
    summarize(cor = cor(S1, S2) ^ 2))

Output

[[1]]
# A tibble: 8 × 2
  yrs10     cor
  <int>   <dbl>
1     1 0.0378 
2     2 0.364  
3     3 0.0371 
4     4 0.00845
5     5 0.00178
6     6 0.420  
7     7 0.0386 
8     8 0.0281 

[[2]]
# A tibble: 5 × 2
  yrs16    cor
  <int>  <dbl>
1     1 0.294 
2     2 0.181 
3     3 0.317 
4     4 0.467 
5     5 0.0289

[[3]]
# A tibble: 4 × 2
  yrs20    cor
  <int>  <dbl>
1     1 0.386 
2     2 0.135 
3     3 0.403 
4     4 0.0556

[[4]]
# A tibble: 2 × 2
  yrs40   cor
  <int> <dbl>
1     1 0.554
2     2 0.660

Additionally, if you did need to get them into the same dataframe, then we could make some adjustments in the function in map:

map(names(x[-c(1:2)]), ~
  x %>%
    group_by(.data[[.x]]) %>%
    summarize(cor = cor(S1, S2) ^ 2) %>%
    mutate(name = colnames(.)[1]) %>%
    rename("value" = 1)) %>%
  bind_rows()

Output

   value     cor name 
   <int>   <dbl> <chr>
 1     1 0.0378  yrs10
 2     2 0.364   yrs10
 3     3 0.0371  yrs10
 4     4 0.00845 yrs10
 5     5 0.00178 yrs10
 6     6 0.420   yrs10
 7     7 0.0386  yrs10
 8     8 0.0281  yrs10
 9     1 0.294   yrs16
10     2 0.181   yrs16
11     3 0.317   yrs16
12     4 0.467   yrs16
13     5 0.0289  yrs16
14     1 0.386   yrs20
15     2 0.135   yrs20
16     3 0.403   yrs20
17     4 0.0556  yrs20
18     1 0.554   yrs40
19     2 0.660   yrs40

I tried that map() route for hours, but could never get the call to parse correctly. The ".data[[.x]]" call is what I could not get correct. Still not sure how that works, but it clearly does. I'll check further documentation. — John Garland, May 09 '22 at 18:03
@JohnGarland The `.data` pronoun essentially makes it explicit that `.x` is a column name instead of just a character string. See `?rlang::.data` for more info. — AndrewGB, May 09 '22 at 18:08
I had gotten stuck on this too last year when I was trying to pass column names through a function, and had tried everything but `.data`. Here's another instance of it being used: https://stackoverflow.com/a/68738067/15293191 — AndrewGB, May 09 '22 at 18:11

Performing multiple different aggregations on same data in dplyr

2 Answers2