Get most frequently occurring factor level in dplyr piping structure

Question

I'd like to be able to find the most frequently occurring level in a factor in a dataset while using dplyr's piping structure. I'm trying to create a new variable that contains the 'modal' factor level when being grouped by another variable.

This is an example of what I'm looking for:

df <- data.frame(cat = stringi::stri_rand_strings(100, 1, '[A-Z]'), num = floor(runif(100, min=0, max=500)))
df <- df %>%
            dplyr::group_by(cat) %>%
            dplyr::mutate(cat_mode = Mode(num))

Where "Mode" is a function that I'm looking for

Mode-determination is hard and problem-ridden, exacerbated when the data is not cleanly unimodal. Are you looking for the mathematic mode (with smoothing) or the most-frequent occurrence? — r2evans, Jul 18 '18 at 23:04
I'm looking for the most frequent occurrence of a categorical variable with few levels (7) — Parseltongue, Jul 18 '18 at 23:05
If Psidom's answer isn't good enough, it would help if you provide the expect result from this sample data. Because you are using random data, you'll need to revise your question with `set.seed` to make it reproducible. — r2evans, Jul 18 '18 at 23:07
if your variable is already a `factor` you might be able to take advantage of `forcats::fct_infreq()` — Nate, Jul 18 '18 at 23:11

score 1 · Accepted Answer · answered Jul 18 '18 at 23:04

Use table to count the items and then use which.max to find out the most frequent one:

df %>%
    group_by(cat) %>%
    mutate(cat_mode = names(which.max(table(num)))) %>% 
    head()

# A tibble: 6 x 3
# Groups: cat [4]
#  cat      num cat_mode
#  <fctr> <dbl> <chr>   
#1 Q      305   138     
#2 W       34.0 212     
#3 R       53.0 53      
#4 D      395   5       
#5 W      212   212     
#6 Q      417   138  
# ...

score 1 · Answer 2 · answered Jul 18 '18 at 23:08

similar question to Is there a built-in function for finding the mode?

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

df %>% 
  group_by(cat) %>% 
  mutate(cat_mode = Mode(num))

# A tibble: 100 x 3
# Groups:   cat [26]
   cat     num cat_mode
   <fct> <dbl>    <dbl>
 1 S        25       25
 2 V        86      478
 3 R       335      335
 4 S       288       25
 5 S       330       25
 6 Q       384      384
 7 C       313      313
 8 H       275      275
 9 K       274      274
10 J        75       75
# ... with 90 more rows

To see for each factor

df %>% 
  group_by(cat) %>% 
  summarise(cat_mode = Mode(num))

 A tibble: 26 x 2
   cat   cat_mode
   <fct>    <dbl>
 1 A          480
 2 B          380
 3 C          313
 4 D          253
 5 E          202
 6 F           52
 7 G          182
 8 H          275
 9 I          356
10 J           75
# ... with 16 more rows

Thanks for this. Would you know how to modify the Mode formula to exclude one of the levels of the factor ("Unknown")? — Parseltongue, Jul 18 '18 at 23:21
You could filter this out before applying the mode function and the grouping? — Vivek Katial, Jul 18 '18 at 23:22

Get most frequently occurring factor level in dplyr piping structure

2 Answers2