How to quickly create multiple summary tables with group_by() / summarise()?

Question

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.

Eg.,

data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))

data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))

etc...

Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.

Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.

Please share sample of your data using `dput()` (not `str` or `head` or picture/screenshot) so others can help. See more here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 — Tung, Aug 13 '18 at 23:33

score 1 · Answer 1 · answered Aug 14 '18 at 10:12

You didn't supply a sample data.set so I created a small example to show how it works.

data <- data_frame(var1 = rep(letters[1:5], 2),
                   var2 = rep(LETTERS[11:15], 2),
                   var3 = 1:10,
                   var4 = 11:20)

A combination of tidyverse packages can get you where you need to be. Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.

Run the code in pieces to see what every step does.

library(dplyr)
library(purrr)
library(tidyr)

outcomes <- data %>% 
  gather(cols, value, -c(var3, var4)) %>% 
  split(.$cols) %>%
  map(~ spread(.x, cols, value)) %>% 
  map(~ group_by_if(.x, is.character) %>% 
        summarise(sumvar3 = sum(var3),
                  meanvar4 = mean(var4)))

outcomes

$`var1`
# A tibble: 5 x 3
  var1  sumvar3 meanvar4
  <chr>   <int>    <dbl>
1 a           7     13.5
2 b           9     14.5
3 c          11     15.5
4 d          13     16.5
5 e          15     17.5

$var2
# A tibble: 5 x 3
  var2  sumvar3 meanvar4
  <chr>   <int>    <dbl>
1 K           7     13.5
2 L           9     14.5
3 M          11     15.5
4 N          13     16.5
5 O          15     17.5

I tried this solution, but I'm getting this error with L4: `Error: Duplicate identifiers for rows.` Maybe I'm not replacing something correctly—should the `.x` be another value? — Khashir, Aug 16 '18 at 20:16
@Khashir, in that case you need to supply a dput (of a part) of your data in your question. This error appears when spreading the data has duplicate combinations of values in the vars. — phiver, Aug 17 '18 at 12:17
Gotcha. I'll have to look at the data more carefully to see why this error comes up—off the top of my head, I don't see why there would be duplicates. I'm guessing these emerge in the intermediate steps? Anyway, I liked this answer more because it doesn't require writing a function (and is cleaner/more concise), but the other one worked regardless of duplicates. I'm guessing there's no easy way to bypass that limitation in spread? — Khashir, Aug 20 '18 at 18:15

score 1 · Accepted Answer · answered Aug 14 '18 at 11:22

Here is a solution (I hope). Creates a list of data frames with the formula you have:

library(tidyverse)

# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
                   var2 = sample(1:2, 5, replace = T),
                   var3 = sample(1:2, 5, replace = T),
                   varM = sample(1:2, 5, replace = T),
                   var5 = rnorm(5, 3, 6),
                   var6 = rnorm(5, 3, 6))

# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))

# Function to be used
group_fun <- function(x, .df = data) {
  .df %>%
      group_by_(.x) %>%
      summarise(sumVar5  = sum(var5),
                meanVar6 = mean(var6))
  }

# Loop over vars
results <- map(vars_to_be_used, group_fun)

# Nice list names
names(results) <- vars_to_be_used

print(results)

How to quickly create multiple summary tables with group_by() / summarise()?

2 Answers2