How to get stats of columns stratified by values of a categorical variable in R

Question

I want to get the mean() and sd() of the different columns in the dataset iris according to the value in the column Species:

> head(iris[order(runif(nrow(iris))), ])
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
50           5.0         3.3          1.4         0.2     setosa
111          6.5         3.2          5.1         2.0  virginica
69           6.2         2.2          4.5         1.5 versicolor
150          5.9         3.0          5.1         1.8  virginica

Without distinguishing among the 3 different species, apply would do the trick:

> stats = apply(iris[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x))); row.names(stats) = c("mean", "sd"); stats
     Sepal.Length Sepal.Width Petal.Length Petal.Width
mean    5.8433333   3.0573333     3.758000   1.1993333
sd      0.8280661   0.4358663     1.765298   0.7622377

But, How can I get a list (?) with these results broken down by species?

@李哲源ZheyuanLi Thank you. Something like this works... `list(means = aggregate(. ~ Species, data = iris, FUN = "mean"), sd = aggregate(. ~ Species, data = iris, FUN = "sd"))`. I wonder if it could be made even more succinct... — Antoni Parellada, Jan 21 '17 at 00:11
Possible duplicate of [Apply function conditionally](http://stackoverflow.com/questions/16657512/apply-function-conditionally) — Barker, Jan 21 '17 at 00:55

score 3 · Accepted Answer · answered Jan 21 '17 at 00:47

aggregate is the function you are looking for:

> aggregate(. ~ Species, data = iris, FUN = mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
> aggregate(. ~ Species, data = iris, FUN = sd)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa    0.3524897   0.3790644    0.1736640   0.1053856
2 versicolor    0.5161711   0.3137983    0.4699110   0.1977527
3  virginica    0.6358796   0.3224966    0.5518947   0.2746501

aggregate computes a function on a data set based on a factor or combination of factors.

score 1 · Answer 2 · answered Jan 20 '17 at 23:28

1

You can split data by species with split function to get list of dataframes

iris2 <- split(iris, iris$Species)
fun <- function(df){
stats = apply(df[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x)))
row.names(stats) = c("mean", "sd") 
return(stats)
}
lapply(iris2, fun)

answered Jan 20 '17 at 23:28

ZsideZ

89
1
1
9

Thank you +1. I was hoping for an `apply`-family single call (perhaps with `sweep()`), but it may not be possible. – Antoni Parellada Jan 20 '17 at 23:31

score 1 · Answer 3 · answered Jan 21 '17 at 00:54

This is not a complete answer (doesn't return a list and doesn't keep the same table structure). Included for awareness of dplyr very useful summarize_all

library(dplyr)
df <- iris %>% group_by(Species) %>% summarise_all(funs(mean, sd)) 

# A tibble: 3 × 9
# Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd
# <fctr>             <dbl>            <dbl>             <dbl>            <dbl>           <dbl>          <dbl>
# 1     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644
# 2 versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983
# 3  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966
# ... with 2 more variables: Petal.Length_sd <dbl>, Petal.Width_sd <dbl>

score 0 · Answer 4 · answered Jan 21 '17 at 02:20

Another option is data.table

library(data.table)
as.data.table(iris)[,unlist(lapply(.SD, function(x)
    list(Mean = mean(x), SD = sd(x))), recursive = FALSE), Species]
#     Species Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1:     setosa             5.006       0.3524897            3.428      0.3790644             1.462       0.1736640            0.246
#2: versicolor             5.936       0.5161711            2.770      0.3137983             4.260       0.4699110            1.326
#3:  virginica             6.588       0.6358796            2.974      0.3224966             5.552       0.5518947            2.026
#   Petal.Width.SD
#1:      0.1053856
#2:      0.1977527
#3:      0.2746501

How to get stats of columns stratified by values of a categorical variable in R

4 Answers4