86

I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:

  color  type
1 black chair
2 black chair
3 black  sofa
4 green  sofa
5 green  sofa
6   red  sofa
7   red plate
8  blue  sofa
9  blue plate
10 blue chair

I want to add for each color, the count of unique types that are present in the data:

  color  type unique_types
1 black chair            2
2 black chair            2
3 black  sofa            2
4 green  sofa            1
5 green  sofa            1
6   red  sofa            2
7   red plate            2
8  blue  sofa            3
9  blue plate            3
10 blue chair            3

I was hoping to use ave, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.

It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame

Henrik
  • 65,555
  • 14
  • 143
  • 159
Bryan
  • 1,771
  • 4
  • 17
  • 30

3 Answers3

87

Here's a solution with the dplyr package - it has n_distinct() as a wrapper for length(unique()).

df %>%
  group_by(color) %>%
  mutate(unique_types = n_distinct(type))
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 1
    Thanks for the dplyr solution Sam. Let's say I wanted to go one step further and "subset" df to include only "color" and the "corresponding number of distinct types". I remember there is a slick function for these but I can't remember, advice? – Prince M Mar 31 '21 at 18:19
80

Using ave (since you ask for it specifically):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})

Make sure that type is character vector and not factor.


Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.


Other solutions:

Using plyr:

require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))

Using aggregate:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    using `length(unique(type))` in the `plyr` version is only equivalent to `data.table::uniqueN(type)` if you have no NA values. `uniqueN` will report 0 for empty groups whereas `length(unique(type))` will report 1! The `dplyr` (not sure about `plyr`) equivalen of `data.table::uniqueN(x, na.rm = TRUE)` is `dplyr::n_distinct(x, na.rm = TRUE)` – Daniel Oct 30 '20 at 19:59
9

This can be also achieved in a vectorized without by group operations by combining unique with table or tabulate

If df$color is factor, then

Either

table(unique(df)$color)[as.character(df$color)]
# black black black green green   red   red  blue  blue  blue 
#    2     2     2     1     1     2     2     3     3     3 

Or

tabulate(unique(df)$color)[as.integer(df$color)]
# [1] 2 2 2 1 1 2 2 3 3 3

If df$color is character then just

table(unique(df)$color)[df$color]

If df$color is an integer then just

tabulate(unique(df)$color)[df$color]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196