189

I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.

df <- data.frame(
    a = sample(1:5, n, replace = TRUE), 
    b = sample(1:5, n, replace = TRUE), 
    c = sample(1:5, n, replace = TRUE), 
    d = sample(1:5, n, replace = TRUE), 
    grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))

This gives me the mean for column "a" for each group indicated by "grp".

My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column?

What I would like to have is something like

df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57
Daniel
  • 7,252
  • 6
  • 26
  • 38

5 Answers5

297

In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:

library(dplyr)

df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#>     grp     a     b     c     d
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1  3.08  2.98  2.98  2.91
#> 2     2  3.03  3.04  2.97  2.87
#> 3     3  2.85  2.95  2.95  3.06

Alternatively, the purrrlyr package provides the same functionality:

library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#>     grp     a     b     c     d
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1  3.08  2.98  2.98  2.91
#> 2     2  3.03  3.04  2.97  2.87
#> 3     3  2.85  2.95  2.95  3.06

Also don't forget about data.table (use keyby to sort sort groups):

library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#>    grp        a        b        c        d
#> 1:   1 3.079412 2.979412 2.979412 2.914706
#> 2:   2 3.029126 3.038835 2.967638 2.873786
#> 3:   3 2.854701 2.948718 2.951567 3.062678

Let's try to compare performance.

library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
  a = sample(1:5, n, replace = TRUE), 
  b = sample(1:5, n, replace = TRUE), 
  c = sample(1:5, n, replace = TRUE), 
  d = sample(1:5, n, replace = TRUE), 
  grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
  dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
  purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
  data.table = dt[, lapply(.SD, mean), keyby = grp],
  check = FALSE
)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        2.81ms   2.85ms      328.        NA     17.3
#> 2 purrrlyr     7.96ms   8.04ms      123.        NA     24.5
#> 3 data.table 596.33µs 707.91µs     1409.        NA     10.3
Henrik
  • 65,555
  • 14
  • 143
  • 159
Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57
  • 1
    This is nice, but what should I do if I just want to apply function, i.e. `paste` to the last column, and for others columns I just want to take the first element or leave as-is? – biocyberman Aug 08 '15 at 17:22
  • 1
    I mean, the behavior like in `select` would be great: `summarize(df, a:c, d=paste(d, collaspe =',' )` . Just want to put more original columns in for reference – biocyberman Aug 08 '15 at 17:28
  • 1
    difference between purrr and dplyr pls ? – Mostafa90 Feb 22 '17 at 19:53
  • 1
    How can I add argument to mean function in that case?`mean(na.rm = TRUE)` into: `df %>% group_by(grp) %>% summarise_each(funs(mean))` – piotr Apr 25 '17 at 20:08
  • 9
    @piotr: `funs(mean(., na.rm = TRUE))` instead `funs(mean)`. – Artem Klevtsov May 14 '17 at 11:55
  • @ArtemKlevtsov In `purrr v0.2.4` both `slice_rows` and `dmap` have been removed. Do you know what are the alternatives to do the same thing in the newer release of `purrr`? I guess `purrr` isn't really necessary and the same can be done with- `iris %>% dplyr::group_by(Species) %>% dplyr::mutate_all(mean) %>% distinct()`, but just wanted to have knowledge of doing the same with `purrr`. – Indrajeet Patil Feb 11 '18 at 16:33
  • `dmap` moved to the `purrrlyr` package. – Artem Klevtsov Feb 12 '18 at 02:47
  • 1
    Instead of `summarise_all(funs(mean))`, I find the following syntax more intuitive to read: `summarise_all(.funs = mean)` – Agile Bean Apr 22 '19 at 07:56
  • 1
    @ArtemKlevtsov Dplyr performance has improved a lot. The current version, 0.8.3, performs at about 2.1 relative to data.table. (Contrast with about 7.0 relative to data.table in the table above.) – user2363777 Nov 28 '19 at 01:43
  • @ArtemKlevtsov is there a way to summarize multiple columns into rows just with the summary function being count (n that are positive) and proportion (%)? – sar Mar 28 '20 at 23:17
  • @ArtemKlevtsov, the summary values in your example aren't the same..? mean for column a, grp 3 is 4.00 in the first two examples and 3.714 in data.table? Was this a different iteration of `df`? – Mark Verhagen Jun 24 '20 at 21:36
  • @mark-verhagen, thank you for the note. Answer updated. – Artem Klevtsov Jun 25 '20 at 03:25
  • 1
    perhaps we should update for dplyr 1.0 with `across` replacement of the summarise suffixes – Pierre L Oct 28 '20 at 23:56
  • The ```list(mean)))``` should be just ```mean```. The code renamed my variables otherwise. – Tea Tree Dec 17 '20 at 01:43
  • Is it possible to apply different functions to differents sets of columns ? – Julien Sep 29 '22 at 09:26
67

We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.

options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)

library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.4’

set.seed(123)
df <- data_frame(
  a = sample(1:5, 10, replace=T), 
  b = sample(1:5, 10, replace=T), 
  c = sample(1:5, 10, replace=T), 
  d = sample(1:5, 10, replace=T), 
  grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type
)

df %>% group_by(grp) %>% 
  summarise_each(.vars = letters[1:4],
                 .funs = c(mean="mean"))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over a selection of variables, use `summarise_at()`
# Error: Strings must match column names. Unknown columns: mean

You should change to the following code. The following codes all have the same result.

# summarise_at
df %>% group_by(grp) %>% 
  summarise_at(.vars = letters[1:4],
               .funs = c(mean="mean"))

df %>% group_by(grp) %>% 
  summarise_at(.vars = names(.)[1:4],
               .funs = c(mean="mean"))

df %>% group_by(grp) %>% 
  summarise_at(.vars = vars(a,b,c,d),
               .funs = c(mean="mean"))

# summarise_all
df %>% group_by(grp) %>% 
  summarise_all(.funs = c(mean="mean"))

# summarise_if
df %>% group_by(grp) %>% 
  summarise_if(.predicate = function(x) is.numeric(x),
               .funs = funs(mean="mean"))
# A tibble: 3 x 5
# grp a_mean b_mean c_mean d_mean
# <chr>  <dbl>  <dbl>  <dbl>  <dbl>
# 1     1   2.80   3.00    3.6   3.00
# 2     2   4.25   2.75    4.0   3.75
# 3     3   3.00   5.00    1.0   2.00

You can also have multiple functions.

df %>% group_by(grp) %>% 
  summarise_at(.vars = letters[1:2],
               .funs = c(Mean="mean", Sd="sd"))
# A tibble: 3 x 5
# grp a_Mean b_Mean      a_Sd     b_Sd
# <chr>  <dbl>  <dbl>     <dbl>    <dbl>
# 1     1   2.80   3.00 1.4832397 1.870829
# 2     2   4.25   2.75 0.9574271 1.258306
# 3     3   3.00   5.00        NA       NA
Keiku
  • 8,205
  • 4
  • 41
  • 44
  • Is it possible can i apply the each column with each function, i.e, for column `a` apply only the `mean` and for column `b` apply only the `sd ` with using the summaise_at – dondapati Jan 02 '18 at 08:08
  • 1
    @user7462639 In your case, you can use `summarise`. i.e, `summarise(a_mean = mean(a), b_sd = sd(b))` – Keiku Jan 03 '18 at 06:47
  • 4
    but what if i want to do mean for columns 1-13, sd for columns 14-30, sum for columns 31-100, and don't want to list them all out? – Arthur Yip Sep 05 '18 at 20:08
  • 2
    I upvoted your comment because I posted yesterday this question [R summarise_at dynamically by condition : mean for some columns, sum for others](https://stackoverflow.com/q/60306857/10489562). – phili_b Feb 20 '20 at 09:11
42

You can simply pass more arguments to summarise:

df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))

Source: local data frame [3 x 5]

  grp  mean(a)  mean(b)  mean(c) mean(d)
1   1 2.500000 3.500000 2.000000     3.0
2   2 3.800000 3.200000 3.200000     2.8
3   3 3.666667 3.333333 2.333333     3.0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • 2
    Great! Is it even possible to do such things if column names and count are unknown? E.g. having 3 or 6 instead of 4 fixed columns? – Daniel Feb 08 '14 at 11:00
  • 4
    That is a TODO in `dplyr` I believe (like `plyr` `colwise`), see here for a rather awkward current solution: http://stackoverflow.com/a/21296364/1527403 – Stephen Henderson Feb 08 '14 at 11:55
  • Thanks a lot to both of you! I'll probably just use a loop to iterate all columns. – Daniel Feb 08 '14 at 16:09
  • 13
    `dplyr` now has `summarise_each` which will operate on each column – rrs Jun 18 '14 at 15:39
  • 1
    `dplyr` has now superseded `summarise_*` with `summarise(across(....` as of `dplyr` 1.0.0 – NelsonGon May 18 '20 at 04:03
7

For completeness: with dplyr v0.2 ddply with colwise will also do this:

> ddply(df, .(grp), colwise(mean))
  grp        a    b        c        d
1   1 4.333333 4.00 1.000000 2.000000
2   2 2.000000 2.75 2.750000 2.750000
3   3 3.000000 4.00 4.333333 3.666667

but it is slower, at least in this case:

> microbenchmark(ddply(df, .(grp), colwise(mean)), 
                  df %>% group_by(grp) %>% summarise_each(funs(mean)))
Unit: milliseconds
                                            expr      min       lq     mean
                ddply(df, .(grp), colwise(mean))     3.278002 3.331744 3.533835
 df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337

   median       uq      max neval
 3.353633 3.378089 7.592209   100
 1.121954 1.133428 2.292216   100
Steven Matz
  • 729
  • 6
  • 4
5

All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function.


library(tidyverse)

# Create reproducible df
set.seed(101)
df <- tibble(a   = sample(1:5, 10, replace=T), 
             b   = sample(1:5, 10, replace=T), 
             c   = sample(1:5, 10, replace=T), 
             d   = sample(1:5, 10, replace=T), 
             grp = sample(1:3, 10, replace=T))

# Convert to tidy format using gather
df %>%
    gather(key = variable, value = value, a:d) %>%
    group_by(grp, variable) %>%
    summarize(mean = mean(value)) %>%
    spread(variable, mean)
#> Source: local data frame [3 x 5]
#> Groups: grp [3]
#> 
#>     grp        a     b        c        d
#> * <int>    <dbl> <dbl>    <dbl>    <dbl>
#> 1     1 3.000000   3.5 3.250000 3.250000
#> 2     2 1.666667   4.0 4.666667 2.666667
#> 3     3 3.333333   3.0 2.333333 2.333333
Matt Dancho
  • 6,840
  • 3
  • 35
  • 26
  • That's another nice approach to keep in mind. Just one thing: I don't agree with Hadley's definition of tidy data always being in long format. Often, you don't want to multiply your observations, but want to have one row per observation. – Daniel Mar 07 '17 at 08:23
  • I don't disagree. Everyone has preferences and for some the wide approach is preferable either from a more intuitive perspective or because there are actually structural reasons you don't want long format. For me, my preference is long format because as I began working with `dplyr` more the long format makes things much easier. – Matt Dancho Mar 07 '17 at 15:59