1

I want to calculate the most frequent value of a categorical variable. I tried using the mlv function in the modeest package, but getting NAs.

user <- c("A","B","A","A","B","A","B","B")
color <- c("blue","green","blue","blue","green","yellow","pink","blue")
df <- data.frame(user,color)
df$color <- as.factor(df$color)

library(plyr)
library(dplyr)
library(modeest)

summary <- ddply(df,.(user),summarise,mode=mlv(color,method="mlv")[['M']])

Warning messages:
1: In discrete(x, ...) : NAs introduced by coercion
2: In discrete(x, ...) : NAs introduced by coercion

summary
   user mode
1    A   NA
2    B   NA

Whereas, I need this:

user  mode
A     blue
B     green

What am I doing wrong? I tried using other methods, as well as just mlv(x=color). According to the help pages of modeest, it should work for factors.

I don't want to use table(), as I need a simple function that I can use to create a summary table like in this question: How to get the mode of a group in summarize in R ,but for a categorical column.

cantordust
  • 23
  • 1
  • 7
  • 1
    Maybe also relevant: [*"Is there a built-in function for finding the mode?"*](https://stackoverflow.com/q/2547402/2204410) – Jaap Oct 20 '17 at 09:44

3 Answers3

3

You should try table. For instance, which.max(table(color)).

loukdelouk
  • 78
  • 4
  • Thanks for your tip! This was just a simple example problem. I actually want to calculate the mode value for each unique value of another column in my huge database (~1 million samples, 10000 unique values), perhaps using dplyr::summarise.. Is there something else I could use to calculate the mode in a functional way? – cantordust Oct 23 '17 at 02:11
  • I have changed the example to be more reflective of my problem. – cantordust Oct 23 '17 at 02:25
  • @cantordust I added an answer using dplyr::summarise for your request. – Agile Bean Apr 22 '19 at 08:41
2

The reason modeest::mlv.factor() does not work might actually be a bug in the package.

In the function mlv.factor() the function modeest:::discrete() is called. In there, this is what happens:

f <- factor(color)
[1] blue   green  blue   blue   green  yellow pink   blue  
Levels: blue green pink yellow

tf <- tabulate(f)
[1] 4 2 1 1

as.numeric(levels(f)[tf == max(tf)])
[1] NA
Warning message:
NAs introduced by coercion 

This is what is returned to mlv.fator(). But levels(f)[tf == max(tf)] equals [1] "blue", hence as.numeric() cannot convert it to a number.

You can calculate the mode by finding the unique values and count how many times they appear in a vector. You can then subset the unique values for the one that appears most (i.e. the mode)

Find unique colours:

unique_colors <- unique(color)

match(color, unique_colors) returns the position of the first match of color in unique_colors. tabulate() then counts the number of times a color occurs. which.max() returns the index of the highest occuring value. This value can then be used to subset the unique colors.

unique_colors[which.max(tabulate(match(color, unique_colors)))]

Perhaps more readable using dplyr

library(dplyr)
unique(color)[color %>%
                match(unique(color)) %>% 
                tabulate() %>%
                which.max()]

Both options return:

[1] blue
Levels: blue green pink yellow

EDIT:

The best way is probably to create your own mode-function:

calculate_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

and then use it in dplyr::summarise():

library(dplyr)

df %>% 
  group_by(user) %>% 
  summarise(color = calculate_mode(color))

Which returns:

# A tibble: 2 x 2
    user  color
  <fctr> <fctr>
1      A   blue
2      B  green
clemens
  • 6,653
  • 2
  • 19
  • 31
  • Thanks for your detailed explanation! I now understand why mlv doesn't work. However, the example I provided was just a sample problem. I actually want to calculate the mode value for each unique value of another column in my huge database (~1 million samples, 10000 unique values), perhaps using dplyr::summarise.. Is there something else I could use to calculate the mode in a functional way? – cantordust Oct 23 '17 at 02:12
  • I have changed the example to be more reflective of my problem. – cantordust Oct 23 '17 at 02:25
  • Thank you so much Clemens! Will create the function as you suggested. I wanted to avoid doing that, as I felt it may be inefficient, but seems like the only option for now. Strange that a statistical tool like R has no in-built mode calculator. – cantordust Oct 24 '17 at 05:53
  • Good to hear. You may consider accepting the answer to indicate your question was answered. – clemens Oct 24 '17 at 06:42
0

Solution with dplyr and purrr

you can use a more generalized version of the correct answer by @loudelouk like this:

df %>% 
  group_by(user) %>% 
  select_if(is.factor) %>% 
  summarise_all(function(x) { x %>% table %>% which.max %>% names })

or shorter:

df %>% 
  group_by(user) %>% 
  summarise_if(is.factor, .funs = function(x) { x %>% table %>% which.max %>% names})
Agile Bean
  • 6,437
  • 1
  • 45
  • 53