Tidyverse: filtering n largest groups in grouped dataframe

Question

I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe

Here is some data

Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)

|Brand | Category| Clicks|
|:-----|--------:|------:|
|A     |        1|     10|
|B     |        2|     11|
|C     |        1|     12|
|A     |        1|     13|
|A     |        2|     14|
|B     |        1|     15|
|A     |        2|     14|
|A     |        1|     13|
|B     |        2|     12|
|C     |        1|     11|

This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination

|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A     |        1|        12.0|
|A     |        2|        14.0|
|B     |        1|        15.0|
|B     |        2|        11.5|

Which I thought could be achieved with code like this (but can't)

df %>%
  group_by(Brand, Category) %>%
  top_n(2, Brand) %>% # Largest 2 brands by count
  summarise(mean_clicks = mean(Clicks))

EDIT: the ideal answer should be able to be used on database tables as well as local tables

You might want to precise what DBMS you intend to use, not all of them support window functions and `dbplyr` might need them for some of its translations. I believe @Ronak's answer should work on a DB where window functions are supported but fail for others because of `add_count`, a fix would be to group and count in a subquery, join it to the original and proceed. — moodymudskipper, Oct 05 '18 at 10:50
actually @Paul's solution might work right away as it seems to do just that — moodymudskipper, Oct 05 '18 at 10:52

Paul · Accepted Answer · 2018-10-08T02:27:02.987

7

Another dplyr solution using a join to filter the data frame:

library(dplyr)

df %>%
  group_by(Brand) %>%
  summarise(n = n()) %>%
  top_n(2) %>% # select top 2
  left_join(df, by = "Brand") %>% # filters out top 2 Brands
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# # A tibble: 4 x 3
# # Groups:   Brand [?]
#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
# 1 A            1        12  
# 2 A            2        14  
# 3 B            1        15  
# 4 B            2        11.5

edited Oct 08 '18 at 02:27

answered Oct 03 '18 at 05:53

Paul

2,877
1
12
28

1

+1 because even if it's longer than some, I believe it's the only solution that will work in all DBMS supported by `dbplyr`. `arrange(desc(n)) %>% head(2)` could be replaced by using `top_n` – moodymudskipper Oct 05 '18 at 10:54

tmfmnk · Answer 2 · 2018-10-03T21:19:48.117

A different dplyr solution:

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank == 1 | rank == 2) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <fct>    <dbl>       <dbl>
1 A           1.        12.0
2 A           2.        14.0
3 B           1.        15.0
4 B           2.        11.5

Or a simplified version (based on suggestions from @camille):

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  filter(dense_rank(desc(n)) < 3) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

The only improvement I'd suggest on this is that since you want the first 2 ranks, instead of `rank == 1 | rank == 2`, you could make it cleaner with `rank <= 2` or `rank < 3`. That way, you can also easily skip the `mutate` line and create a rank inside the `filter` call — camille, Oct 02 '18 at 15:04

Ronak Shah · Answer 3 · 2018-09-27T08:19:43.213

2

EDIT

Based on updated question, we can add a count column first, filter only top n group counts, then group_by Brand and Category to find the mean for each group.

df %>%
  add_count(Brand, sort = TRUE) %>%
  filter(n %in% head(unique(n), 2)) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))


#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
#1 A            1        12  
#2 A            2        14  
#3 B            1        15  
#4 B            2        11.5

Original Answer

We can group_by Brand and do all the calculations by group and then filter top groups by top_n

library(dplyr)
df %>%
  group_by(Brand) %>%
  summarise(n = n(), 
            mean = mean(Clicks)) %>%
  top_n(2, n) %>%
  select(-n)

#  Brand  mean
#  <fct> <dbl>
#1  A      12.8
#2  B      12.7

edited Sep 27 '18 at 08:19

answered Sep 27 '18 at 07:59

Ronak Shah

377,200
20
156
213

Thank you for your answer Ronak, but I can see a situation where this will not work. I will edit my question – Shinobi_Atobe Sep 27 '18 at 08:10
using `filter(n %in% head(unique(n), 2))` is a very clever idea. But I cant help but thinking there must be a faster / more elegant solution? – Shinobi_Atobe Sep 27 '18 at 08:30
@Shinobi_Atobe I agree. Even I have a feeling that there should be something better, can't think of anything right now though. – Ronak Shah Sep 27 '18 at 08:41
If you filter based on `dense_rank` ([as @tmfmnk did](https://stackoverflow.com/a/52532885/1851712)), you can skip the sorting in `add_count` and use `df %>% add_count(Brand) %>% filter(dense_rank(desc(n)) %in% 1:2)`. Have no clue if this is 'faster / more elegant'..;) – Henrik Sep 27 '18 at 11:05
@Henrik yes.definitely but it does not reduce any of the step. Just replaces one with another. – Ronak Shah Sep 27 '18 at 12:48

Sotos · Answer 4 · 2018-09-27T08:27:13.917

A data.table idea is to get the counts grouped by Brands and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by (Brand, Category)

library(data.table)

#Convert to data.table
dt1 <- setDT(df)

dt1[dt1[, .(cnt = .N), by = Brand][
             order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL], 
                   on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]

which gives,

   Brand Category means
1:     A        1  12.0
2:     A        2  14.0
3:     B        2  11.5
4:     B        1  15.0

Shree · Answer 5 · 2018-10-05T13:17:35.157

0

How about this approach, using table, from base R -

df %>%
  filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <chr>    <dbl>       <dbl>
1 A         1.00        12.0
2 A         2.00        14.0
3 B         1.00        15.0
4 B         2.00        11.5

edited Oct 05 '18 at 13:17

answered Oct 02 '18 at 22:00

Shree

10,835
1
14
36

or `names(tail(sort(table(x)),2))` to be a bit more compact, a function `is_in_top_group <- function(x,n) x %in% names(tail(sort(table(x)),n))` would make it convenient to reuse. – moodymudskipper Oct 05 '18 at 08:47
@Moody_Mudskipper Thanks! don't know how I missed that. Edited my answer. – Shree Oct 05 '18 at 13:18

score 0 · Answer 6 · edited Jun 20 '20 at 09:12

0

Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>% 
filter(Brand %in% Top2$Brand) %>% 
summarise(mean_clicks = mean(Clicks))
remove(Top2)

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 08 '18 at 07:19

turqoisehat

1

Tidyverse: filtering n largest groups in grouped dataframe

6 Answers6

Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

Linked