How to get summary statistics by group

Question

I'm trying to get multiple summary statistics in R/S-PLUS grouped by categorical column in one shot. I found couple of functions, but all of them do one statistic per call, like aggregate().

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)
mg <- aggregate(df$dt, by=df$group, FUN=mean)    
mg <- aggregate(df$dt, by=df$group, FUN=sum)

What I'm looking for is to get multiple statistics for the same group like mean, min, max, std, ...etc in one call, is that doable?

This one is a pretty basic question with multiple answers. You may not be familiar with RSeek [(LINK)](http://www.rseek.org/) and the sos library [(LINK)](http://cran.r-project.org/web/packages/sos/index.html) Both are great resources to help you figure out the answers to questions. Ibet with those resources you'll be able to answer your own question in seconds. — Tyler Rinker, Mar 23 '12 at 22:13
I just found a wonderful R package [tables](http://cran.r-project.org/web/packages/tables/). You can tabulate data by as many categories as you desire and calculate multiple statistics for multiple variables - it truly is amazing! But wait, there's more! The package has functions to generate LaTeX code for your tables for easy import to your documents. — StatGrrl, Aug 09 '14 at 21:05

BenBarnes · Answer 1 · 2021-04-19T09:52:02.423

1. `tapply`

I'll put in my two cents for tapply().

tapply(df$dt, df$group, summary)

You could write a custom function with the specific statistics you want or format the results:

tapply(df$dt, df$group,
  function(x) format(summary(x), scientific = TRUE))
$A
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.900e+01" "5.975e+01" "6.100e+01" "6.100e+01" "6.225e+01" "6.300e+01" 

$B
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.300e+01" "6.425e+01" "6.550e+01" "6.600e+01" "6.675e+01" "7.100e+01" 

$C
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"6.600e+01" "6.725e+01" "6.800e+01" "6.800e+01" "6.800e+01" "7.100e+01" 

$D
       Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
"5.600e+01" "5.975e+01" "6.150e+01" "6.100e+01" "6.300e+01" "6.400e+01"

2. `data.table`

The data.table package offers a lot of helpful and fast tools for these types of operation:

library(data.table)
setDT(df)
> df[, as.list(summary(dt)), by = group]
   group Min. 1st Qu. Median Mean 3rd Qu. Max.
1:     A   59   59.75   61.0   61   62.25   63
2:     B   63   64.25   65.5   66   66.75   71
3:     C   66   67.25   68.0   68   68.00   71
4:     D   56   59.75   61.5   61   63.00   64

@maximusyoda, to get scientific notation, use a custom function instead of `summary` such as: `tapply(df$dt, df$group, function(x) format(summary(x), scientific = TRUE))` — BenBarnes, Oct 25 '14 at 19:21
@JorgeParedes, do you mean the list of summary statistics? I use the `data.table` package for these kinds of operations usually. I'll update the answer with an example. — BenBarnes, Apr 19 '21 at 09:40

Jot eN · Answer 2 · 2019-07-03T15:20:50.037

65

dplyr package could be nice alternative to this problem:

library(dplyr)

df %>% 
  group_by(group) %>% 
  summarize(mean = mean(dt),
            sum = sum(dt))

To get 1st quadrant and 3rd quadrant

df %>% 
  group_by(group) %>% 
  summarize(q1 = quantile(dt, 0.25),
            q3 = quantile(dt, 0.75))

edited Jul 03 '19 at 15:20

answered Nov 10 '14 at 10:59

Jot eN

6,120
4
40
59

conor · Answer 3 · 2017-09-08T14:35:38.453

44

Using Hadley Wickham's purrr package this is quite simple. Use split to split the passed data_frame into groups, then use map to apply the summary function to each group.

library(purrr)

df %>% split(.$group) %>% map(summary)

edited Sep 08 '17 at 14:35

answered Aug 12 '16 at 14:52

conor

1,267
10
7

2

df %>% group_by(group) %>% do(data.frame(summary(.))) should do something similar in dplyr – AlexR Aug 12 '16 at 15:12
1

This seems to produce identical output as the `tapply` approach using base R. – dsaxton Sep 16 '16 at 15:32

score 21 · Answer 4 · 2013-08-26T15:47:22.060

21

There's many different ways to go about this, but I'm partial to describeBy in the psych package:

describeBy(df$dt, df$group, mat = TRUE)

edited Aug 26 '13 at 15:47

answered Mar 24 '12 at 05:46

score 12 · Answer 5 · answered Mar 23 '12 at 22:13

12

take a look at the plyr package. Specifically, ddply

ddply(df, .(group), summarise, mean=mean(dt), sum=sum(dt))

answered Mar 23 '12 at 22:13

Justin

42,475
9
93
111

score 11 · Answer 6 · answered Jan 23 '17 at 16:53

after 5 long years I'm sure not much attention is going to be received for this answer, But still to make all options complete, here is the one with data.table

library(data.table)
setDT(df)[ , list(mean_gr = mean(dt), sum_gr = sum(dt)) , by = .(group)]
#   group mean_gr sum_gr
#1:     A      61    244
#2:     B      66    396
#3:     C      68    408
#4:     D      61    488

score 8 · Answer 7 · edited Sep 15 '20 at 09:38

8

The psych package has a great option for grouped summary stats:

library(psych)
    
describeBy(dt, group="grp")

produces lots of useful stats including mean, median, range, sd, se.

edited Sep 15 '20 at 09:38

Gerhard

6,850
8
51
81

answered Mar 09 '20 at 10:50

FGP

97
1
7

score 7 · Answer 8 · answered Apr 22 '19 at 12:18

7

While some of the other approaches work, this is pretty close to what you were doing and only uses base r. If you know the aggregate command this may be more intuitive.

with( df , aggregate( dt , by=list(group) , FUN=summary)  )

answered Apr 22 '19 at 12:18

MatthewR

2,660
5
26
37

1

shout out to this one for using base R, returning a data.frame, and using the summary function so I don't need to write one. – user3055034 Feb 08 '21 at 05:59
Careful: it does not return a data.frame (each column in the resulting summary visualization is not a data.frame name). It's a nice, efficient, clever solution. – Matteo Castagna May 11 '22 at 09:51

score 6 · Answer 9 · answered Dec 26 '13 at 05:04

6

Besides describeBy, the doBy package is an another option. It provides much of the functionality of SAS PROC SUMMARY. Details: http://www.statmethods.net/stats/descriptives.html

answered Dec 26 '13 at 05:04

dwstu

839
10
12

Another quick way to tabulate data (without descriptive stats) is to use `freq` function in the `descr` package. That is not strictly what you asked for, but may still be instructive. Details: http://www.rdocumentation.org/packages/descr/functions/freq – dwstu Dec 26 '13 at 05:14

MS Berends · Answer 10 · 2021-05-11T07:11:55.030

Not sure why the popular skimr package hasn’t been brought up. Their function skim() was meant to replace the base R summary() and supports dplyr grouping:

library(dplyr)
library(skimr)

starwars %>%
  group_by(gender) %>%
  skim()

#> ── Data Summary ────────────────────────
#>                            Values    
#> Name                       Piped data
#> Number of rows             87        
#> Number of columns          14        
#> _______________________              
#> Column type frequency:               
#>   character                7         
#>   list                     3         
#>   numeric                  3         
#> ________________________             
#> Group variables            gender    
#> 
#> ── Variable type: character ──────────────────────────────────────────────────────
#>    skim_variable gender    n_missing complete_rate   min   max empty n_unique
#>  1 name          feminine          0         1         3    18     0       17
#>  2 name          masculine         0         1         3    21     0       66
#>  3 name          <NA>              0         1         8    14     0        4
#>  4 hair_color    feminine          0         1         4     6     0        6
#>  5 hair_color    masculine         5         0.924     4    13     0        9
#>  6 hair_color    <NA>              0         1         4     7     0        4
#> # [...]
#> 
#> ── Variable type: list ───────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate n_unique min_length max_length
#> 1 films         feminine          0             1        9          1          5
#> 2 films         masculine         0             1       24          1          7
#> 3 films         <NA>              0             1        3          1          2
#> 4 vehicles      feminine          0             1        3          0          1
#> 5 vehicles      masculine         0             1        9          0          2
#> 6 vehicles      <NA>              0             1        1          0          0
#> # [...]
#> 
#> ── Variable type: numeric ────────────────────────────────────────────────────────
#>   skim_variable gender    n_missing complete_rate  mean     sd    p0   p25   p50
#> 1 height        feminine          1         0.941 165.   23.6     96 162.  166. 
#> 2 height        masculine         4         0.939 177.   37.6     66 171.  183  
#> 3 height        <NA>              1         0.75  181.    2.89   178 180.  183  
#> # [...]

score 4 · Answer 11 · answered Nov 28 '21 at 18:55

With more recent (>1.0) versions of dplyr you can do so with

iris %>% 
  group_by(Species)  %>% 
  summarise(as_tibble(rbind(summary(Sepal.Length))))

This works because dplyr will unpack the result of summarise into columns if the argument evaluates into a dataframe.

Ekow_ababio · Answer 12 · 2022-03-21T14:20:20.887

I would also recommend gtsummary (written by Daniel D. Sjoberg et al). You can generate publication-ready or presentation-ready tables with the package. A gtsummary solution to the example given in the question would be:

library(tidyverse)
library(gtsummary)

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
          71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)


tbl_summary(df, 
            by=group,
            type = all_continuous() ~ "continuous2",
            statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
  add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

which then gives you the output below

Characteristic	A, N = 4	B, N = 6	C, N = 6	D, N = 8
dt
Mean (SD)	61.0 (1.8)	66.0 (2.8)	68.0 (1.7)	61.0 (2.6)
Meian (IQR)	61.0 (2.5)	65.5 (2.5)	68.0 (0.8)	61.5 (3.2)
Min- Max	59.0 - 63.0	63.0 - 71.0	66.0 - 71.0	56.0 - 64.0

You can also export the table as word document by doing the following:

Table1 <-  tbl_summary(df, 
                by=group,
                type = all_continuous() ~ "continuous2",
                statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>% 
      add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))

tmp1 <- "~path/name.docx"

Table1 %>% 
  as_flex_table() %>% 
  flextable::save_as_docx(path=tmp1)

You can use it for regression outputs as well. See the package reference manual and the package webpage for further insights

https://cran.r-project.org/web/packages/gtsummary/index.html https://www.danieldsjoberg.com/gtsummary/index.html

score 1 · Answer 13 · answered Mar 03 '21 at 08:47

1

this may also work,

spl <- split(mtcars, mtcars$cyl)
list.of.summaries <- lapply(spl, function(x) data.frame(apply(x[,3:6], 2, summary)))
list.of.summaries

answered Mar 03 '21 at 08:47

Seyma Kalay

2,037
10
22

score 1 · Answer 14 · answered Mar 23 '12 at 23:40

1

First, it depends on your version of R. If you've passed 2.11, you can use aggreggate with multiple results functions(summary, by instance, or your own function). If not, you can use the answer made by Justin.

answered Mar 23 '12 at 23:40

CPHM

19
1

Maël · Answer 15 · 2023-08-04T09:02:00.730

collapse offers a very flexible function for summary statistics with qsu:

library(collapse)
with(df, qsu(dt, g = group))

#    N  Mean      SD  Min  Max
# A  4    61  1.8257   59   63
# B  6    66  2.8284   63   71
# C  6    68  1.6733   66   71
# D  8    61  2.6186   56   64

It's also very fast:

microbenchmark::microbenchmark(
  tapply = tapply(df$dt, df$group, summary),
  dt = setDT(df)[, as.list(summary(dt)), by = group],
  collapse = qsu(df$dt, g = df$group),
  purrr = df %>% split(.$group) %>% purrr::map(summary)
)

# Unit: microseconds
#      expr    min      lq     mean  median     uq    max neval
#    tapply  453.2  503.75  531.718  522.70  548.6  946.8   100
#        dt  998.8 1076.90 1288.057 1127.55 1205.9 9569.6   100
#  collapse   14.8   24.45   38.432   36.90   43.9  121.6   100
#     purrr 2553.6 2728.85 2847.378 2816.75 2940.8 3715.8   100

How to get summary statistics by group

15 Answers15

1. `tapply`

2. `data.table`

Linked

Related

How to get summary statistics by group

15 Answers15

1. tapply

2. data.table

Linked

Related

1. `tapply`

2. `data.table`