4

I am trying to find a way to get summary stats such as means by group and overall in one step using dplyr

#Data set-up
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)


library("tidyverse")

#Using dplyr to get means by group and overall
mean_by_sex <- dsn %>% 
  group_by(sex) %>% 
  summarise(mean_age = mean(age))

mean_all <- dsn %>% 
  summarise(mean_age = mean(age)) %>% 
  add_column(sex = "All")

#combining the results by groups and overall
final_result <- rbind(mean_by_sex, mean_all)
final_result  
#> # A tibble: 3 x 2
#>   sex   mean_age
#>   <fct>    <dbl>
#> 1 F         24.0
#> 2 M         20.0
#> 3 All       21.9
#This is the table I want but I wonder if is the only way to do this

Is there a way this in shorter step using group_by_at or group_by_all or a similar functions using tidyverse and dplyr Any help would be greatly appreciated

SimRock
  • 229
  • 3
  • 10
  • 1
    Similar to: https://stackoverflow.com/questions/31164350/dplyr-summarize-with-subtotals (which doesn't really have a better answer than what you are using) or another alternative: https://stackoverflow.com/questions/34045447/how-to-analyse-a-data-set-both-grouped-by-and-ungrouped-in-one-analysis-using-dp – MrFlick Feb 27 '20 at 16:38
  • The https://cran.r-project.org/package=tables package can do that in one line: `library(tables); tabular(sex + 1 ~ age * mean, dsn)` – G. Grothendieck Feb 27 '20 at 19:57

3 Answers3

3

One option could perhaps be:

dsn %>%
 group_by(sex) %>%
 summarise(mean_age = mean(age)) %>%
 add_row(sex = "ALL", mean_age = mean(dsn$age))

  sex   mean_age
  <fct>    <dbl>
1 F         24.0
2 M         20.0
3 ALL       21.9
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

A little switching around can do it, too.

final_result <- dsn %>% 
  add_row(sex = "All", age = mean(age)) %>% 
  group_by(sex) %>% 
  summarise(mean_age = mean(age))
vanao veneri
  • 970
  • 2
  • 12
  • 31
  • We posted almost the same thing at the same :) However, there is a substantial difference: your solution won't work if age is not in the global environment. – tmfmnk Feb 27 '20 at 16:48
  • Thank you @vanao veneri. I slightly modified the code and it worked – SimRock Feb 27 '20 at 19:43
0

These answers are great if you have one variable to summarize by. What about two? I want to summarize across one but leave the other as is. The above solutions do not work in this case because the data frame still needs to be grouped.

#Data set up 
set.seed(3243242)
dsn <- tibble(
  obese = sample(c(TRUE, FALSE), size=100, replace = TRUE),
  sex = sample(c("M", "F"), size=100, replace=TRUE),
                  age = rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
                    )
library("tidyverse")

I restated the original problem using 2 group_by variables.

#Extend to 2 group_by variables?
df1 <- dsn %>%
  group_by(sex, obese) %>% 
  summarise(mean_age = mean(age)) %>%
  ungroup() 

#Also across sex
df2 <- dsn %>%
  group_by(obese) %>% 
  summarise(mean_age = mean(age)) %>%
  ungroup() 

#Final_result:
bind_rows(df1, df2)

Way to do this in one step? You can add mean with add_row() but not with a grouped df. Another option is to create a function that does all the things on the group dataset. If there are other things you want to do, say sort or create new variables, you can do it in the function. Then, you can apply the function to each grouped dataset. After combining via dplyr::bind_rows(), you can change the missing group variable to all via tidyr::replace_na().

  #'@param df_group A grouped tibble
find_summary <- function(df_group){
  df_group %>% 
summarize(mean_age = mean(age))  #add other dplyr verbs here as needed like arrange or mutate
}

bind_rows(
    find_summary(group_by(dsn, sex, obese)),
    find_summary(group_by(dsn, obese))
    ) %>%
     replace_na(list(sex = "all"))
sex   obese mean_age
  <chr> <lgl>    <dbl>
1 F     FALSE     24.0
2 F     TRUE      24.0
3 M     FALSE     20.0
4 M     TRUE      20.0
5 all   FALSE     21.7
6 all   TRUE      22.3

You can extend the idea if you want a summary of all variables, by one variable, and by two variables.

bind_rows(
    find_summary(group_by(dsn, sex, obese)),
    find_summary(group_by(dsn, obese)),
    find_summary(dsn)
    ) %>%
     replace_na(list(sex = "all", obese = "all"))
  sex   obese mean_age
  <chr> <chr>    <dbl>
1 F     FALSE     24.0
2 F     TRUE      24.0
3 M     FALSE     20.0
4 M     TRUE      20.0
5 all   FALSE     21.7
6 all   TRUE      22.3
7 all   all       22.0
phargart
  • 655
  • 3
  • 14