Is there a dplyr function to determine the most commonly encountered categorical value within a group?

Question

I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:

For example to take the table Cus1

Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
             Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
              Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
              )

And generate the table Cus_Summary:

Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
              Product = c("COKE", "BURGER", "CHICKEN"),
              Store = c("NYC", "Chicago", "LA")
              )

Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?

I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).

Maurits Evers · Accepted Answer · 2019-01-22T11:44:01.407

2

How about this?

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = first(names(sort(table(Product), decreasing = TRUE))),
        Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
#  Customer Product Store
#  <fct>    <chr>   <chr>
#1 C-01     COKE    NYC
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA

Note that in the case of ties this selects the first entry in alphabetical order.

Update

To randomly select an entry from tied top frequency entries we could define a custom function

top_random <- function(x) {
    tbl <- sort(table(x), decreasing = T)
    top <- tbl[tbl == max(tbl)]
    return(sample(names(top), 1))
}

Then the following randomly selects one of the tied top entries:

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = top_random(Product),
        Store = top_random(Store))

edited Jan 22 '19 at 11:44

answered Jan 22 '19 at 06:47

Maurits Evers

49,617
4
47
68

Your's solution do not consider the case where more than one value is most frequent. – Paweł Chabros Jan 22 '19 at 09:27
1

@PawełChabros Yes it does; in the case of ties it choses the first entry in alphabetical order. Additionally, see RonakShah's and OPs comment regarding ties. Your downvotes are quite unwarranted. – Maurits Evers Jan 22 '19 at 09:32
Ok. My bad. I will reverse my downvote if you will do some edit :) – Paweł Chabros Jan 22 '19 at 09:51
1

Should be possible here as well. – Ronak Shah Jan 22 '19 at 09:51
@PawełChabros your turn;-) – Maurits Evers Jan 22 '19 at 09:57
I like this - it's succinct and easily readable. A couple of questions: 1 - are there any memory issues if this is to be applied to a large data frame? Does the generation of the table within the group when applied thousands of times pose any problems? 2 - now I think about the ties issue some more, is it possible to select a result at random rather than the first alphabetically to avoid creating a bias in the resultant output? :-) – Brisbane Pom Jan 22 '19 at 11:30
@BrisbanePom Concerning 1: I find this difficult to answer; in the end a benchmark analysis will tell. However, `tidyverse` solutions are not usually geared towards speed. If speed/memory issues are your concern, a `data.table` solution might be better, but again this will depend on your specific data. Concerning 2: Interesting suggestion. I've made an edit to show how to randomly select entries from ties. – Maurits Evers Jan 22 '19 at 11:45
As is typical when applying code to the real world application, a problem has arisen - I am getting errors (probably due to missing values causing the function to crash). Any chance you could update that top_random function with a Try / Catch construct Maurits, as my kack-handed attempt failed! – Brisbane Pom Feb 04 '19 at 05:56
@BrisbanePom More than happy to take a look at it. Would you be able to post some minimal sample data that reproduces the errors you're getting? – Maurits Evers Feb 04 '19 at 06:20
Just o improve a little bit, add `if(all(is.na(x))){ return("") }` in case your column only has NA (all are missing values). – Omar Benites Aug 18 '22 at 03:44

Ronak Shah · Answer 2 · 2019-01-22T14:22:16.490

0

If you have many columns and want to find out maximum occurrence in all the columns you could use gather to convert the data in long format, count the occurrence for each column, group_by Customer and column and keep only the rows with maximum count and then spread it back to wide format.

library(tidyverse)

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  slice(which.max(n)) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)

# Customer Product Store  
#  <fct>    <chr>   <chr>  
#1 C-01     COKE    NYC    
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA

EDIT

In case of ties if we want to randomly select ties we can filter all the max values and then use sample_n function to select random rows.

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  filter(n == max(n)) %>%
  sample_n(1) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)


# Customer Product Store   
#  <fct>    <chr>   <chr>   
#1 C-01     COKE    NYC     
#2 C-02     BURGER  Chicago 
#3 C-03     FISH    San Fran

edited Jan 22 '19 at 14:22

answered Jan 22 '19 at 06:52

Ronak Shah

377,200
20
156
213

Your's solution do not consider the case where more than one value is most frequent. – Paweł Chabros Jan 22 '19 at 09:27
1

@PawełChabros OP mentioned `I am not worried about smart ways to handle ties - any output for a tie will suffice`. IMO, taking taking first max value would be included in "any output for tie". – Ronak Shah Jan 22 '19 at 09:29
@PawełChabros you should be able to do that now. – Ronak Shah Jan 22 '19 at 09:50
Yes, this works well for my original question where ties are not an issue. Now I've thought about it, would there be a way to apply the slice for max(n) in such a way that the resultant slice value is selected at random? Would be interested to compare an approach under this method with the two approaches given above. – Brisbane Pom Jan 22 '19 at 11:40
@BrisbanePom We can `filter` all the `max` values and then use `sample_n` to select random rows. I have updated the answer. – Ronak Shah Jan 22 '19 at 14:22

score 0 · Answer 3 · answered Jan 22 '19 at 09:25

In my solution, if there are more than one most frequent value, all are presented:

library(tidyverse)

Cus %>%
  gather('type', 'value', -Customer) %>%
  group_by(Customer, type, value) %>%
  count() %>%
  group_by(Customer) %>%
  filter(n == max(n)) %>%
  nest() %>%
  mutate(
    Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
    Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
  ) %>%
  select(-data)

Result is:

# A tibble: 3 x 3
  Customer Product       Store           
  <fct>    <chr>         <chr>           
1 C-01     COKE          NYC             
2 C-02     BURGER        Chicago, Detroit
3 C-03     CHICKEN, FISH LA, San Fran

If we could choose one of the two variants at random, rather than join the strings by a comma, this solution may be the best way to pick a random value from the tie to ensure bias does not get introduced as it would by the alphabetical-first method (see comment in answer given above) — Brisbane Pom, Jan 22 '19 at 11:35

score 0 · Answer 4 · answered Jan 22 '19 at 15:18

Using SO's favourite Mode function (though you could use any):

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

In base R

aggregate(. ~ Customer, lapply(Cus,as.character),  Mode)
#   Customer Product   Store
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

using dplyr

library(dplyr)
Cus %>%
  group_by(Customer) %>%
  summarise_all(Mode)

# # A tibble: 3 x 3
# Customer Product   Store
# <fctr>  <fctr>  <fctr>
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

Is there a dplyr function to determine the most commonly encountered categorical value within a group?

4 Answers4

Update