11

I am trying to find the most frequent value by group. In the following example dataframe:

df<-data.frame(a=c(1,1,1,1,2,2,2,3,3),b=c(2,2,1,2,3,3,1,1,2))  
> df  
  a b  
1 1 2  
2 1 2  
3 1 1  
4 1 2  
5 2 3  
6 2 3  
7 2 1  
8 3 1  
9 3 2  

I would like to add a column 'c' which has the most occurring value in 'b' when its values are grouped by 'a'. I would like the following output:

> df  
  a b c  
1 1 2 2    
2 1 2 2    
3 1 1 2    
4 1 2 2    
5 2 3 3    
6 2 3 3    
7 2 1 3    
8 3 1 1   
9 3 2 1    

I tried using table and tapply but didn't get it right. Is there a fast way to do that?
Thanks!

smci
  • 32,567
  • 20
  • 113
  • 146
Asif Shakeel
  • 125
  • 1
  • 1
  • 4
  • 3
    This is very related to [this](http://stackoverflow.com/questions/2547402/standard-library-function-in-r-for-finding-the-mode) – David Arenburg Mar 25 '15 at 12:25

3 Answers3

16

Building on Davids comments your solution is the following:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
df %>% group_by(a) %>% mutate(c=Mode(b))

Notice though that for the tie when df$a is 3 then the mode for b is 1.

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
8

We could get the 'Mode' of 'b' grouped by 'a' using ave

 Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}

df$c <-  with(df, ave(b, a, FUN=Mode))
df$c
#[1] 2 2 2 2 3 3 3 1 1

Or using data.table

library(data.table)
setDT(df)[, c:= Mode(b), by=a][]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you @akrun! To exclude NAs from the mode function, change the second line to 'ux <- unique(x[!is.na(x)])'. – WalterB May 05 '20 at 16:57
1

Here is a base R method that uses table to calculate a cross tab, max.col to find the mode per group, and rep together with rle to fill in the mode across groups.

# calculate a cross tab, frequencies by group
myTab <- table(df$a, df$b)
# repeat the mode for each group, as calculated by colnames(myTab)[max.col(myTab)] 
# repeating by the number of times the group ID is observed
df$c <- rep(colnames(myTab)[max.col(myTab)], rle(df$a)$length)

df
  a b c
1 1 2 2
2 1 2 2
3 1 1 2
4 1 2 2
5 2 3 3
6 2 3 3
7 2 1 3
8 3 1 2
9 3 2 2

Note that this assumes the data has been sorted by group. Also, the default of max.col is to break ties (mulitple modes) at random. If you want the first or last value to be the mode, you can set this using the ties.method argument.

lmo
  • 37,904
  • 9
  • 56
  • 69