118

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?

Here's an example with fake data.

library(dplyr)

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))

# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)

# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)

  b    count_a
1 1    6
2 2    6
3 3    0

# Now try it with dplyr
df %.%
  group_by(b) %.%
  summarise(count_a=length(a), .drop=FALSE)

  b     count_a .drop
1 1     6       FALSE
2 2     6       FALSE

Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?

eipi10
  • 91,525
  • 24
  • 209
  • 285

4 Answers4

74

The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:

library(tidyr)
df %>%
  group_by(b) %>%
  summarise(count_a=length(a)) %>%
  complete(b)
# Source: local data frame [3 x 2]
# 
#        b count_a
#   (fctr)   (int)
# 1      1       6
# 2      2       6
# 3      3      NA

If you wanted the replacement value to be zero, you need to specify that with fill:

df %>%
  group_by(b) %>%
  summarise(count_a=length(a)) %>%
  complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
# 
#        b count_a
#   (fctr)   (dbl)
# 1      1       6
# 2      2       6
# 3      3       0
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 15
    It took me a lot of head banging against the wall to figure this out so I will mention it here... If you group by 2 variables, and they are characters rather than factors, you will need to use `ungroup()` before you complete. If you ever notice `complete` not actually completing, `ungroup` is probably needed. – williamsurles Jun 13 '17 at 16:21
  • What If you have even more grouping variables? I get a huge number of rows (much more than my original dataframe) if I use all the grouping vars from my group_by – TobiO Mar 09 '18 at 15:46
  • 2
    I figured it out: You have to use nesting :-) So put all the Variables that should not also be combined among themselves in `complete(variablewithdroppedlevels, nesting(var1,var2,var3))` (it's actually in the help for `complete` still took me a while to figure out – TobiO Mar 09 '18 at 16:09
46

Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)

df %>%
  group_by(b, .drop=FALSE) %>%
  summarise(count_a=length(a))

#> # A tibble: 3 x 2
#>   b     count_a
#>   <fct>   <int>
#> 1 1           6
#> 2 2           6
#> 3 3           0

One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:

library(dplyr)
data(iris)

# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))

# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally

#>   Species         n
#> 1 setosa         50
#> 2 versicolor     50
#> 3 virginica      50
#> 4 empty_level     0

# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))

# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#>   Species     group2     n
#> 1 setosa      A         25
#> 2 setosa      B         25
#> 3 versicolor  A         25
#> 4 versicolor  B         25
#> 5 virginica   B         25
#> 6 virginica   C         25
#> 7 empty_level <NA>       0

# Turn group2 into a factor
iris$group2 = factor(iris$group2)

# Now all possible combinations of Species and group2 are included in the output, 
#  whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#>    Species     group2     n
#>  1 setosa      A         25
#>  2 setosa      B         25
#>  3 setosa      C          0
#>  4 versicolor  A         25
#>  5 versicolor  B         25
#>  6 versicolor  C          0
#>  7 virginica   A          0
#>  8 virginica   B         25
#>  9 virginica   C         25
#> 10 empty_level A          0
#> 11 empty_level B          0
#> 12 empty_level C          0

Created on 2019-03-13 by the reprex package (v0.2.1)
eipi10
  • 91,525
  • 24
  • 209
  • 285
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • I've added an additional note to your answer. Please feel free to delete if you don't like the edit. – eipi10 Mar 13 '19 at 20:52
  • I've [filed an issue about this](https://github.com/tidyverse/dplyr/issues/4282) on github to find out whether this is a bug or the intended behavior. – eipi10 Mar 13 '19 at 21:18
  • @eipi10 slightly shorter is the use of `count`: `iris %>% count(Species, group2, .drop=FALSE)` – tjebo Oct 10 '19 at 15:46
  • @tjebo, I'm pretty sure this does not work (as of dplyr 1.0.7). I don't get the zero-count groups. – rocarvaj Nov 11 '21 at 16:45
  • @rocarvaj at least on my dplyr 1.0.7 it still works. – tjebo Nov 11 '21 at 21:04
  • for me as well @rocarvaj, can you confirm that you can't reproduce the first output and that `packageVersion("dplyr")`returns `‘1.0.7’`? And make sure you don't skip the `df$b = factor(df$b, levels=1:3)`line – moodymudskipper Nov 12 '21 at 12:24
  • 1
    @tjebo and Moody_Mudskipper, my bad. I forgot to convert the columns to factors. I'll punish myself for my poor reading comprehension. Please ignore my previous comment. Thanks! :) – rocarvaj Nov 12 '21 at 12:47
  • By the way, it also doesn't work when counting over a `datetime`, you *must* convert to factor for this to work – Matias Andina Dec 16 '21 at 16:38
22

dplyr solution:

First make grouped df

by_b <- tbl_df(df) %>% group_by(b)

then we summarise those levels that occur by counting with n()

res <- by_b %>% summarise( count_a = n() )

then we merge our results into a data frame that contains all factor levels:

expanded_res <- left_join(expand.grid(b = levels(df$b)),res)

finally, in this case since we are looking at counts the NA values are changed to 0.

final_counts <- expanded_res[is.na(expanded_res)] <- 0

This can also be implemented functionally, see answers: Add rows to grouped data with dplyr?

A hack:

I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!

by_b <- tbl_df(df) %>% group_by(b)

define an "out-of-bounds" value that cannot exist in dataset.

oob_val <- nrow(by_b)+1

modify attributes to "trick" summarise():

attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3

do the summary:

res <- by_b %>% summarise(count_a = n())

index and replace all occurences of oob_val

res[res == oob_val] <- 0

which gives the intended:

> res
Source: local data frame [3 x 2]

b count_a
1 1       6
2 2       6
3 3       0
Community
  • 1
  • 1
npjc
  • 4,134
  • 1
  • 22
  • 34
13

this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:

using dplyr:

df %>%
  xtabs(formula = ~ b) %>%
  as.data.frame()

or shorter:

as.data.frame(xtabs( ~ b, df))

result (equal in both cases):

  b Freq
1 1    6
2 2    6
3 3    0
tjebo
  • 21,977
  • 7
  • 58
  • 94
talat
  • 68,970
  • 21
  • 126
  • 157