22

I want to group a data frame by a column (owner) and output a new data frame that has counts of each type of a factor at each observation. The real data frame is fairly large, and there are 10 different factors.

Here is some example input:

library(dplyr)
df = tbl_df(data.frame(owner=c(0,0,1,1), obs1=c("quiet", "loud", "quiet", "loud"), obs2=c("loud", "loud", "quiet", "quiet")))

  owner  obs1  obs2
1     0 quiet  loud
2     0  loud  loud
3     1 quiet quiet
4     1  loud quiet

I was looking for output that looks like this:

out = data.frame(owner=c("0", "0", "1", "1"), observation=c("obs1", "obs2", "obs1", "obs2"), quiet=c(1, 0, 1, 2), loud=c(1, 2, 1, 0))

  owner observation quiet loud
1     0        obs1     1    1
2     0        obs2     0    2
3     1        obs1     1    1
4     1        obs2     2    0

Melting gets me partway there:

melted = tbl_df(melt(df, id=c("owner")))

  owner variable value
1     0     obs1 quiet
2     0     obs1  loud
3     1     obs1 quiet
4     1     obs1  loud
5     0     obs2  loud
6     0     obs2  loud
7     1     obs2 quiet
8     1     obs2 quiet

But what's the last step? If 'value' was a numeric, I'd just go:

melted %>% group_by(owner, variable) %>% summarise(counts=sum(value))

Thanks so much!

Rory Kirchner
  • 323
  • 1
  • 3
  • 5
  • This is an old question, but for what it's worth there's a little-known feature of `dcast` that allows you to apply an aggregating/summary function in these cases. I think it defaults to counting. – shadowtalker Jun 13 '15 at 23:16

3 Answers3

32

In 2017 the answer is

library(dplyr)
library(tidyr)

gather(df, key, value, -owner) %>%
  group_by(owner, key, value) %>%
  tally %>% 
  spread(value, n, fill = 0)

Which gives output

Source: local data frame [4 x 4]
Groups: owner, key [4]

  owner   key  loud quiet
* <dbl> <chr> <dbl> <dbl>
1     0  obs1     1     1
2     0  obs2     2     0
3     1  obs1     1     1
4     1  obs2     0     2

In 2019 the answer is:

gather(df, key, value, -owner) %>% 
    count(owner, key, value) %>% 
    spread(value, n, fill = 0)
baxx
  • 3,956
  • 6
  • 37
  • 75
Sam Clifford
  • 436
  • 4
  • 4
29

You could use tidyr with dplyr

library(dplyr)
library(tidyr)

 df %>%
 gather(observation, Val, obs1:obs2) %>% 
 group_by(owner,observation, Val) %>% 
 summarise(n= n()) %>%
 ungroup() %>%
 spread(Val, n, fill=0)

which gives the output

  #    owner observation loud quiet
  #1     0        obs1    1     1
  #2     0        obs2    2     0
  #3     1        obs1    1     1
  #4     1        obs2    0     2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    `df %>% gather(observation, Val, obs1:obs2) %>% group_by(owner, variable, value) %>% summarise(n= n()) %>% spread(value, n, fill=0)` – Rory Kirchner Sep 12 '14 at 16:12
  • @Rory Kirchner The column names should be consistent. Here, in the `gather(...)` you created a variable `Val`, but in the `group_by(...)` and later on that variable was discarded and in its places `value` was used. – akrun Sep 12 '14 at 16:19
  • Hm-- Val -> value for me: df %>% gather(observation, Val, obs1:obs2) -> owner variable value as the column names – Rory Kirchner Sep 12 '14 at 16:36
  • 2
    I'm getting "index out of bounds" error with your `spread` approach. – Paulo E. Cardoso Nov 03 '14 at 13:54
  • 1
    @Paulo Cardoso It did work with the version I had before. Now, it seems like you have to do `ungroup()` before `spread` as Val` is one of the `grouping` variable. – akrun Nov 03 '14 at 14:02
  • True. Thank you for updating your answer. – Paulo E. Cardoso Nov 03 '14 at 16:28
3

If you wanted to forego the dplyr, you can split into lists.

df <- split(df, list(df[[obs1]], df[[obs2]])

If you wanted the count, you just create an sapply or lapply call to run through the lists and get the count of each one. Or literally any other function you want.

black_sheep07
  • 2,308
  • 3
  • 26
  • 40