Group by a factor and then summarise a different variable

Question

I have data in this format, where samples are in groups (in this example A or B), have a numerical quantity and a quality score (which is a factor).

I would like to summarise the qual_score by each group_name.

Example Data:

group_name <- rep(c("A","B"),5)
qual_score <- c(rep("POOR",4),rep("FAIR",1),rep("GOOD",5))
quantity <- 5:14

df <- data.frame(group_name, qual_score, quantity)

> df
   group_name qual_score quantity
1           A       POOR        5
2           B       POOR        6
3           A       POOR        7
4           B       POOR        8
5           A       FAIR        9
6           B       FAIR       10
7           A       GOOD       11
8           B       GOOD       12
9           A       GOOD       13
10          B       GOOD       14

Desired Output:

desired_output <- data.frame(c("2","2"),c("1","0"),c("2","3"))
colnames(desired_output) <- c("POOR", "FAIR", "GOOD")
rownames(desired_output) <- c("A", "B")
desired_output

  POOR FAIR GOOD
A    2    1    2
B    2    0    3

I can do summary() of qual_score for the entire dataframe:

> summary(df$qual_score)
FAIR GOOD POOR 
   2    4    4

And can group_by() to summarise mean(quantity) according to each group:

> df %>%
+     group_by(group_name) %>%
+     summarise(mean(quantity))
# A tibble: 2 x 2
  group_name `mean(quantity)`
  <fct>                 <dbl>
1 A                         9
2 B                        10

But when I try to use group_by() with summary() I get a warning and the following output:

> df %>%
+     group_by(group_name) %>%
+     summary(qual_score)
 group_name qual_score    quantity    
 A:5        FAIR:2     Min.   : 5.00  
 B:5        GOOD:4     1st Qu.: 7.25  
            POOR:4     Median : 9.50  
                       Mean   : 9.50  
                       3rd Qu.:11.75  
                       Max.   :14.00  
Warning messages:
1: In if (length(ll) > maxsum) { :
  the condition has length > 1 and only the first element will be used
2: In if (length(ll) > maxsum) { :
  the condition has length > 1 and only the first element will be used

Without `dplyr`: `table(df[, c('group_name', 'qual_score')])` — Stewart Macdonald, Jul 01 '19 at 00:16
I think what you need is `df %>% count(group_name, qual_score) %>% spread(qual_score, n, fill = 0)` — Ronak Shah, Jul 01 '19 at 00:22

M-- · Accepted Answer · 2019-07-01T00:44:20.433

library(dplyr)

df %>% 
  group_by(group_name) %>% 
  select(-quantity) %>% 
  table()

#>           qual_score
#> group_name FAIR GOOD POOR
#>          A    1    2    2
#>          B    0    3    2

If you want a solution completely in tidyverse:

library(dplyr)
library(tidyr)

df %>% 
  group_by(group_name, qual_score) %>%
  tally() %>%
  spread(qual_score, n, fill=0) 

#> # A tibble: 2 x 4
#> # Groups:   group_name [2]
#>   group_name  FAIR  GOOD  POOR
#>   <fct>      <dbl> <dbl> <dbl>
#> 1 A              1     2     2
#> 2 B              0     3     2

Group by a factor and then summarise a different variable

1 Answers1