17

I'm trying to use the dplyr package to apply a function to all columns in a data.frame that are not being grouped, which I would do with aggregate():

aggregate(. ~ Species, data = iris, mean)

where mean is applied to all columns not used for grouping. (Yes, I know I can use aggregate, but I'm trying to understand dplyr.)

I can use summarize like this:

species <- group_by(iris, Species)
summarize(species,
          Sepal.Length = mean(Sepal.Length),
          Sepal.Width = mean(Sepal.Width))

But is there a way to have mean() applied to all columns that are not grouped, similar to the . ~ notation of aggregate()? I have a data.frame with 30 columns that I want to aggregate, so writing out the individual statements is not ideal.

kmm
  • 6,045
  • 7
  • 43
  • 53
  • 2
    See this previous **[SO Q/A](http://stackoverflow.com/questions/21295936/can-dplyr-summarise-over-several-variables-without-listing-each-one)**. – BrodieG Mar 25 '14 at 21:15

2 Answers2

35

If you're willing to try out an experimental dplyr, you can try out the new (and still experimental) summarise_each():

devtools::install_github("hadley/dplyr", ref = "colwise")

library(dplyr)
iris %.%
  group_by(Species) %.%
  summarise_each(funs(mean))
## Source: local data frame [3 x 5]
## 
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

iris %.%
  group_by(Species) %.%
  summarise_each(funs(min, max))
## Source: local data frame [3 x 9]
## 
##      Species Sepal.Length_min Sepal.Width_min Petal.Length_min
## 1     setosa              4.3             2.3              1.0
## 2 versicolor              4.9             2.0              3.0
## 3  virginica              4.9             2.2              4.5
## Variables not shown: Petal.Width_min (dbl), Sepal.Length_max (dbl),
##   Sepal.Width_max (dbl), Petal.Length_max (dbl), Petal.Width_max (dbl)

Feedback much appreciated!

This will appear in dplyr 0.2.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • Works perfectly for me. Even returns NaN for groups that are all missing data. – kmm Mar 25 '14 at 21:58
  • I can't replicate the error on the iris data sets. But i have data that that gives true when I do all(a$date == a$CALENDAR_YEAR_MONTH). But doing group_by(a, date) %.% summarise_each(funs(median = median(.,na.rm=T), mean)) gives Error in `[.data.table`(dt, , list(median = median(CALENDAR_YEAR_MONTH, : Column 1 of result for group 4 is type 'integer' but expecting type 'double'. Column types must be consistent for each group. – xiaodai Aug 19 '14 at 05:14
  • 1
    fyi, `summarize_each()` has been deprecated in favor of `summarize_all()` – slizb May 24 '18 at 14:42
4

This will get you almost all the way in dplyr.

h = iris %.%
  group_by(Species) %.%
  do(function(d){
    sapply(Filter(is.numeric, d), mean)  
  })

as.data.frame(h)
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • I wouldn't recommend using `do()` in that way, as it's likely to change in 0.2 – hadley Mar 25 '14 at 20:14
  • 2
    Is there an idiomatic way to do it in `dplyr`? In `data.table` I can do `data.table(iris)[,lapply(.SD, mean),Species]`. – Ramnath Mar 26 '14 at 01:48