0

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.

library(tidyverse)
library(DescTools)

# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
                "apple,NA,apple,chocolate", 
                "shoes/socks,NA,NA,NA",
                "apple,apple,chocolate,chocolate")

df <- data.frame(id, categories)

# the solution I would like to achieve
solution <- df %>% 
  mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
         winner_count = c(1, 2, 1, 2))

Based on these answers I have tried the following: Write a function that finds the most common word in a string of text using R

trial <- df %>% 
  rowwise() %>% 
  mutate(winner = names(which.max(table(categories %>% str_split(",")))),
         winner_count = which.max(table(categories %>% str_split(",")))[[1]])

Also tried to follow this approach, however it also does not give me the required results How to find the most repeated word in a vector with R

trial2 <- df %>% 
  mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))

I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

1 Answers1

0

split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id

library(dplyr)
library(tidyr)

df %>%
  separate_rows(categories, sep = ',') %>%
  count(id, categories, name = 'winner_count') %>%
  filter(categories != 'NA') %>%
  group_by(id) %>%
  slice_max(winner_count, n = 1, with_ties = FALSE) %>%
  ungroup %>%
  rename(winner = categories) %>%
  left_join(df, by = 'id') -> result

result

#    id winner      winner_count categories                                
#  <dbl> <chr>              <int> <chr>                                     
#1     1 apple                  1 apple,shoes/socks,trousers/jeans,chocolate
#2     2 apple                  2 apple,NA,apple,chocolate                  
#3     3 shoes/socks            1 shoes/socks,NA,NA,NA                      
#4     4 apple                  2 apple,apple,chocolate,chocolate     
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213