summarizing counts of a factor with dplyr

Question

I want to group a data frame by a column (owner) and output a new data frame that has counts of each type of a factor at each observation. The real data frame is fairly large, and there are 10 different factors.

Here is some example input:

library(dplyr)
df = tbl_df(data.frame(owner=c(0,0,1,1), obs1=c("quiet", "loud", "quiet", "loud"), obs2=c("loud", "loud", "quiet", "quiet")))

  owner  obs1  obs2
1     0 quiet  loud
2     0  loud  loud
3     1 quiet quiet
4     1  loud quiet

I was looking for output that looks like this:

out = data.frame(owner=c("0", "0", "1", "1"), observation=c("obs1", "obs2", "obs1", "obs2"), quiet=c(1, 0, 1, 2), loud=c(1, 2, 1, 0))

  owner observation quiet loud
1     0        obs1     1    1
2     0        obs2     0    2
3     1        obs1     1    1
4     1        obs2     2    0

Melting gets me partway there:

melted = tbl_df(melt(df, id=c("owner")))

  owner variable value
1     0     obs1 quiet
2     0     obs1  loud
3     1     obs1 quiet
4     1     obs1  loud
5     0     obs2  loud
6     0     obs2  loud
7     1     obs2 quiet
8     1     obs2 quiet

But what's the last step? If 'value' was a numeric, I'd just go:

melted %>% group_by(owner, variable) %>% summarise(counts=sum(value))

Thanks so much!

This is an old question, but for what it's worth there's a little-known feature of `dcast` that allows you to apply an aggregating/summary function in these cases. I think it defaults to counting. — shadowtalker, Jun 13 '15 at 23:16

score 32 · Accepted Answer · edited Dec 25 '19 at 16:18

32

In 2017 the answer is

library(dplyr)
library(tidyr)

gather(df, key, value, -owner) %>%
  group_by(owner, key, value) %>%
  tally %>% 
  spread(value, n, fill = 0)

Which gives output

Source: local data frame [4 x 4]
Groups: owner, key [4]

  owner   key  loud quiet
* <dbl> <chr> <dbl> <dbl>
1     0  obs1     1     1
2     0  obs2     2     0
3     1  obs1     1     1
4     1  obs2     0     2

In 2019 the answer is:

gather(df, key, value, -owner) %>% 
    count(owner, key, value) %>% 
    spread(value, n, fill = 0)

edited Dec 25 '19 at 16:18

baxx

3,956
6
37
75

answered Jan 19 '17 at 07:05

Sam Clifford

436
4
4

This is the better answer these days. – Monduiz Mar 04 '17 at 19:35
1

In 2019 the answer is ```` gather(df, key, value, -owner) %>% count(owner, key, value) %>% spread(value, n, fill = 0) ```` – Sam Clifford Apr 01 '19 at 14:17
5

in 2019 is pivot_longer / wider recommended over gather / spread? – baxx Dec 25 '19 at 12:07
@baxx [It seems so](https://tidyr.tidyverse.org/reference/gather.html) (I had to look up gather to remind myself what it is). – Hendy Jan 31 '22 at 21:55

akrun · Answer 2 · 2014-11-03T14:02:21.573

29

You could use tidyr with dplyr

library(dplyr)
library(tidyr)

 df %>%
 gather(observation, Val, obs1:obs2) %>% 
 group_by(owner,observation, Val) %>% 
 summarise(n= n()) %>%
 ungroup() %>%
 spread(Val, n, fill=0)

which gives the output

  #    owner observation loud quiet
  #1     0        obs1    1     1
  #2     0        obs2    2     0
  #3     1        obs1    1     1
  #4     1        obs2    0     2

edited Nov 03 '14 at 14:02

answered Sep 12 '14 at 15:45

akrun

874,273
37
540
662

1

`df %>% gather(observation, Val, obs1:obs2) %>% group_by(owner, variable, value) %>% summarise(n= n()) %>% spread(value, n, fill=0)` – Rory Kirchner Sep 12 '14 at 16:12
@Rory Kirchner The column names should be consistent. Here, in the `gather(...)` you created a variable `Val`, but in the `group_by(...)` and later on that variable was discarded and in its places `value` was used. – akrun Sep 12 '14 at 16:19
Hm-- Val -> value for me: df %>% gather(observation, Val, obs1:obs2) -> owner variable value as the column names – Rory Kirchner Sep 12 '14 at 16:36
2

I'm getting "index out of bounds" error with your `spread` approach. – Paulo E. Cardoso Nov 03 '14 at 13:54
1

@Paulo Cardoso It did work with the version I had before. Now, it seems like you have to do `ungroup()` before `spread` as Val` is one of the `grouping` variable. – akrun Nov 03 '14 at 14:02
True. Thank you for updating your answer. – Paulo E. Cardoso Nov 03 '14 at 16:28

score 3 · Answer 3 · answered Dec 12 '15 at 01:08

If you wanted to forego the dplyr, you can split into lists.

df <- split(df, list(df[[obs1]], df[[obs2]])

If you wanted the count, you just create an sapply or lapply call to run through the lists and get the count of each one. Or literally any other function you want.

summarizing counts of a factor with dplyr

3 Answers3

Linked