Rowwise find most frequent term in dataframe column and count occurrences

Question

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.

library(tidyverse)
library(DescTools)

# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
                "apple,NA,apple,chocolate", 
                "shoes/socks,NA,NA,NA",
                "apple,apple,chocolate,chocolate")

df <- data.frame(id, categories)

# the solution I would like to achieve
solution <- df %>% 
  mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
         winner_count = c(1, 2, 1, 2))

Based on these answers I have tried the following: Write a function that finds the most common word in a string of text using R

trial <- df %>% 
  rowwise() %>% 
  mutate(winner = names(which.max(table(categories %>% str_split(",")))),
         winner_count = which.max(table(categories %>% str_split(",")))[[1]])

Also tried to follow this approach, however it also does not give me the required results How to find the most repeated word in a vector with R

trial2 <- df %>% 
  mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))

I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.

Ronak Shah · Accepted Answer · 2021-04-19T09:50:46.593

split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id

library(dplyr)
library(tidyr)

df %>%
  separate_rows(categories, sep = ',') %>%
  count(id, categories, name = 'winner_count') %>%
  filter(categories != 'NA') %>%
  group_by(id) %>%
  slice_max(winner_count, n = 1, with_ties = FALSE) %>%
  ungroup %>%
  rename(winner = categories) %>%
  left_join(df, by = 'id') -> result

result

#    id winner      winner_count categories                                
#  <dbl> <chr>              <int> <chr>                                     
#1     1 apple                  1 apple,shoes/socks,trousers/jeans,chocolate
#2     2 apple                  2 apple,NA,apple,chocolate                  
#3     3 shoes/socks            1 shoes/socks,NA,NA,NA                      
#4     4 apple                  2 apple,apple,chocolate,chocolate

Rowwise find most frequent term in dataframe column and count occurrences

1 Answers1