How to calculate mean of all columns, by group?

Question

I need to get the mean of all columns of a large data set using R, grouped by 2 variables.

Lets try it with mtcars:

library(dplyr)
g_mtcars <- group_by(mtcars, cyl, gear)
summarise(g_mtcars, mean (hp))

# Source: local data frame [8 x 3]
# Groups: cyl [?]
# 
#     cyl  gear `mean(hp)`
#   <dbl> <dbl>      <dbl>
# 1     4     3    97.0000
# 2     4     4    76.0000
# 3     4     5   102.0000
# 4     6     3   107.5000
# 5     6     4   116.5000
# 6     6     5   175.0000
# 7     8     3   194.1667
# 8     8     5   299.5000

It works for "hp", but I need to get the mean for every other columns of mtcars (except "cyl" and "gear" that make a group). The data set is large, with several columns. Typing it by hand, like this: summarise(g_mtcars, mean (hp), mean(drat), mean (wt),...) is not practical.

PS: Using summarize is just my first attempt, and the only one I thought of. Please, be free to use any other R function that you want. — Miguel Rozsas, Dec 03 '16 at 11:39
We are supposed to guess which package contains `group_by`? Please make examples fully reproducible. — Bhas, Dec 03 '16 at 11:45
@Bhas No, we aren't. The function is from the `dplyr` package, and that should have been mentioned in the post. I edited it. — RHertel, Dec 03 '16 at 11:47

Wojciech Książek · Accepted Answer · 2020-08-05T12:14:12.383

Edit2: Recent version of dplyr suggests using regular summarise with across function, as in:

library(dplyr)
mtcars %>% 
group_by(cyl, gear) %>%
summarise(across(everything(), mean))

What you're looking for is either ?summarise_all or ?summarise_each from dplyr

Edit: full code:

library(dplyr)
mtcars %>% 
    group_by(cyl, gear) %>%
    summarise_all("mean")

# Source: local data frame [8 x 11]
# Groups: cyl [?]
# 
#     cyl  gear    mpg     disp       hp     drat       wt    qsec    vs    am     carb
#   <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl> <dbl> <dbl>    <dbl>
# 1     4     3 21.500 120.1000  97.0000 3.700000 2.465000 20.0100   1.0  0.00 1.000000
# 2     4     4 26.925 102.6250  76.0000 4.110000 2.378125 19.6125   1.0  0.75 1.500000
# 3     4     5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000   0.5  1.00 2.000000
# 4     6     3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300   1.0  0.00 1.000000
# 5     6     4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700   0.5  0.50 4.000000
# 6     6     5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000   0.0  1.00 6.000000
# 7     8     3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425   0.0  0.00 3.083333
# 8     8     5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500   0.0  1.00 6.000000

Great ! I didn't know about summarise_all (and summarise_each) which are new to me. Thank you very much ! — Miguel Rozsas, Dec 03 '16 at 18:21
In your original answer and in "Edit2" how would you enter the ```na.rm = TRUE``` argument into the mean function. — Dylan Dijk, Dec 26 '20 at 19:57
you can pass a `purrr` style lambda as a function, i.e. `~ mean(., na.rm = TRUE))`, instead of just a function name — Wojciech Książek, Dec 27 '20 at 22:09

score 6 · Answer 2 · answered Dec 05 '18 at 20:06

aggregate is the easiest way to do this in base:

aggregate(. ~ cyl + gear, data = mtcars, FUN = mean)
#   cyl gear    mpg     disp       hp     drat       wt    qsec  vs   am     carb
# 1   4    3 21.500 120.1000  97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2   6    3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 3   8    3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 4   4    4 26.925 102.6250  76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 5   6    4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6   4    5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 7   6    5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 8   8    5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000

score 4 · Answer 3 · edited Jan 21 '20 at 19:03

4

using data.table.(however you can't setDT(mtcars) because binding is locked. copy it to a different name like dt and try

 library(data.table)
 mt_dt = as.data.table(mtcars)
 mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear")]

edited Jan 21 '20 at 19:03

Gregor Thomas

136,190
20
167
294

answered Dec 03 '16 at 11:39

joel.wilson

8,243
5
28
48

Thank you ! It works for me. I will learn more about setDT which it is new for me. – Miguel Rozsas Dec 03 '16 at 18:20
basically, setDT helps to convert a data.frame to data.table – joel.wilson Dec 03 '16 at 18:59
Very clean and easy solution , however with load of columns some are factors in my case. Error in gmean(somefactorcolumname) : mean is not meaningful for factors. With `mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear"), .SDcols = sapply(mt_dt, is.numeric) ]` you are on the safe side – Birte Nov 12 '20 at 11:27

Maël · Answer 4 · 2023-05-05T14:25:01.630

With dplyr 1.1.0, you can use .by to do inline grouping:

summarise(mtcars, across(everything(), mean), .by = c(cyl, gear))

#   cyl gear    mpg     disp       hp     drat       wt    qsec  vs   am     carb
# 1   6    4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 2   4    4 26.925 102.6250  76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 3   6    3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 4   8    3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 5   4    3 21.500 120.1000  97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 6   4    5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 7   8    5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
# 8   6    5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000

score 0 · Answer 5 · answered May 05 '23 at 14:25

Another possibility is to use collapse, which is noticeably fast:

library(collapse)
mtcars %>% 
  fgroup_by(cyl, gear) %>% 
  fmean()

#or, equivalently, fmean(mtcars, GRP(mtcars, ~ cyl + gear))

Benchmark: collapse is much faster.

microbenchmark::microbenchmark(
  collapse = mtcars %>% 
    fgroup_by(cyl, gear) %>% 
    fmean(),
  dplyr = summarise(mtcars, across(everything(), mean), .by = c(cyl, gear)),
  data.table = {mt_dt = as.data.table(mtcars)
  mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear")]},
  aggregate = aggregate(. ~ cyl + gear, data = mtcars, FUN = mean)
)

# Unit: microseconds
#        expr      min        lq     mean   median       uq       max neval
#    collapse   77.002  136.9505  150.770  152.051  168.401   311.001   100
#       dplyr 5588.001 6008.5515 6461.616 6257.501 6537.651 15146.801   100
#  data.table 3600.201 4404.0005 5284.708 5451.701 5784.752 12903.301   100
#   aggregate 2318.900 2581.5010 2812.408 2669.851 2778.851 10259.802   100

loki · Answer 6 · 2016-12-03T11:47:40.457

-1

You can use multiple mean statements in dplyr::summarize like this:

library(dplyr)

mtcars %>% 
  group_by(cyl, gear) %>% 
  summarize(mean_hp = mean(hp), mean_wt = mean(wt))

# Source: local data frame [8 x 4]
# Groups: cyl [?]

#     cyl  gear  mean_hp  mean_wt
#   <dbl> <dbl>    <dbl>    <dbl>
# 1     4     3  97.0000 2.465000
# 2     4     4  76.0000 2.378125
# 3     4     5 102.0000 1.826500
# 4     6     3 107.5000 3.337500
# 5     6     4 116.5000 3.093750
# 6     6     5 175.0000 2.770000
# 7     8     3 194.1667 4.104083
# 8     8     5 299.5000 3.370000

edited Dec 03 '16 at 11:47

answered Dec 03 '16 at 11:39

loki

9,816
7
56
82

Here we are supposed to guess which package defines `%>%`? Please make examples fully reproducible. – Bhas Dec 03 '16 at 11:46
2

Yes, I known I can, but the set is large and it has too many columns to type each of them. Thank you. – Miguel Rozsas Dec 03 '16 at 18:16
Quoting the question, *"Typing it by hand, like this: `summarise(g_mtcars, mean (hp), mean(drat), mean (wt),...)` is not practical.*" – Gregor Thomas Nov 16 '21 at 15:37

score -1 · Answer 7 · answered Dec 03 '16 at 15:21

-1

For the sake of completeness you could use package plyr and do this:

library(plyr)
ddply(mtcars,c('cyl','gear'), summarize,mean_hp=mean(hp))

answered Dec 03 '16 at 15:21

Bhas

1,844
1
11
9

1

But I still have the same problem. How to get mean for each variable of a large dataset. – Miguel Rozsas Dec 03 '16 at 18:18
This answers the [related FAQ How to calculate mean by group](https://stackoverflow.com/q/11562656/903061), but it ignores the "of all columns" in this question. – Gregor Thomas Nov 16 '21 at 15:36

How to calculate mean of all columns, by group?

7 Answers7

Linked

Related