47

Would like to get a hand on dplyr code, but cannot figure this out. Have seen a similar issue described here for many variables (summarizing counts of a factor with dplyr and Putting rowwise counts of value occurences into new variables, how to do that in R with dplyr?), however my task is somewhat smaller.
Given a data frame, how do I count the frequency of a variable and place that in a new variable.

set.seed(9)
df <- data.frame(
    group=c(rep(1,5), rep(2,5)),
    var1=round(runif(10,1,3),0))

Then we have:

>df
   group var1
1      1    1
2      1    1
3      1    1
4      1    1
5      1    2
6      2    1
7      2    2
8      2    2
9      2    2
10     2    3

Would like a third column indicating per-group (group) how many times var1 occurs, in this example this would be: count=(4,4,4,4,1,1,3,3,3,1). I tried - without success - things like:

df %>%  group_by(group) %>% rowwise() %>% do(count = nrow(.$var1))

Explanations are very appreciated!

Community
  • 1
  • 1
user3375672
  • 3,728
  • 9
  • 41
  • 70

4 Answers4

66

All you need to do is group your data by both columns, "group" and "var1":

df %>% group_by(group, var1) %>% mutate(count = n())
#Source: local data frame [10 x 3]
#Groups: group, var1
#
#   group var1 count
#1      1    1     4
#2      1    1     4
#3      1    1     4
#4      1    1     4
#5      1    2     1
#6      2    1     1
#7      2    2     3
#8      2    2     3
#9      2    2     3
#10     2    3     1

Edit after comment

Here's an example of how you SHOULD NOT DO IT:

df %>% group_by(group, var1) %>% do(data.frame(., count = length(.$group)))

The dplyr implementation with n() is for sure much faster, cleaner and shorter and should always be preferred over such implementations as above.

talat
  • 68,970
  • 21
  • 126
  • 157
  • Nice @beginneR! So if I had more columns like ID, name whatever, I should still only group the two I am interested in ? – user3375672 Nov 06 '14 at 16:33
  • 1
    @user3375672, If you had another grouping variable in addition to the existing two ("group" and "var1"), called "ID" for example, you would group your data by those three columns and then use mutate and n() like in my answer. – talat Nov 06 '14 at 16:36
  • 1
    Still, I don't get it, why I cannot replace nrow() instead of the n() function from dplyr – user3375672 Nov 06 '14 at 16:39
  • 1
    Well, that is mainly because you are using dplyr which has a number of special functions designed to work with dplyr's syntax and `n()` is one of them. Technically, you could use the base R function `nrow()` in a `do()` statement in dplyr BUT that would make any sense because dplyr's special functions make your life so much easier. – talat Nov 06 '14 at 16:42
  • If you have **lots of data**, this command is SUPERFAST – luchonacho Oct 18 '21 at 18:53
29

Perhaps this is new functionality, but it can be done with one dplyr command:

df %>% add_count(group, var1)
   group  var1     n
 1     1     1     4
 2     1     1     4
 3     1     1     4
 4     1     1     4
 5     1     2     1
 6     2     1     1
 7     2     2     3
 8     2     2     3
 9     2     2     3
10     2     3     1
Black Adder
  • 460
  • 6
  • 12
11

We may use probably another handy function tally from dplyr

df %>% group_by(group, var1) %>% tally()
# Source: local data frame [5 x 3]
# Groups: group
# 
#   group var1 n
# 1     1    1 4
# 2     1    2 1
# 3     2    1 1
# 4     2    2 3
# 5     2    3 1
KFB
  • 3,501
  • 3
  • 15
  • 18
  • 1
    If that was the desired result (which it is not, according to the question) you could even just use `df %>% count(group, var1)`. – talat Nov 06 '14 at 17:39
  • @beginneR, Yep, depends on how we understand the required format of output. This looks more terse since it only shows the unique pairs. – KFB Nov 06 '14 at 17:43
  • 1
    I agree, after reading the Q again, that it's not 100% clear.. Anyway, it's a good option to consider. – talat Nov 06 '14 at 17:48
  • 2
    I agree to! So the answer from beginneR was what I had in mind, i.e. to add the results to the data frame structure keeping the number of rows. – user3375672 Nov 07 '14 at 11:44
  • 1
    I am using this solution, however I need to put the results back in my long-format dataframe. Can anybody tell me how do I do that? – RobertP. Jan 24 '18 at 23:56
9

Two alternatives:

1: with base R:

# option 1:
df$count <- ave(df$var1, df$var1, df$group, FUN = length)
# option 2:
df <- transform(df, count = ave(var1, var1, group, FUN = length))

which gives:

> df
   group var1 count
1      1    1     4
2      1    1     4
3      1    1     4
4      1    1     4
5      1    2     1
6      2    1     1
7      2    2     3
8      2    2     3
9      2    2     3
10     2    3     1

2: with :

library(data.table)
setDT(df)[, count := .N, by = .(group, var1)]

which gives the same result:

> df
    group var1 count
 1:     1    1     4
 2:     1    1     4
 3:     1    1     4
 4:     1    1     4
 5:     1    2     1
 6:     2    1     1
 7:     2    2     3
 8:     2    2     3
 9:     2    2     3
10:     2    3     1

If you want to summarise, you can use:

# with base R:
aggregate(id ~ group + var1, transform(df, id = 1), length)

# with 'dplyr':
count(df, group, var1)

# with 'data.table':
setDT(df)[, .N, by = .(group, var1)]
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Can't understand why `df$count <- ave(df$var1, df$group, FUN = length)` does not work. Why need to specify `var1` twice? – luchonacho Oct 18 '21 at 16:24
  • If you have **lots of data**, this command is SUPER SLOW. Prefer the others. – luchonacho Oct 18 '21 at 18:53
  • RE 1st comment: The first `var1` is the input for the lengt-function, the 2nd `var1` is used as a grouping variable (together with `group`). RE 2nd comment: If you have lots of data, use the [tag:data.table] option; its the fastest available in the R ecosystem. – Jaap Oct 18 '21 at 20:26
  • 1
    @luchonacho the data.table option is the fastest one. – skan Mar 18 '22 at 20:12