0

I am looking for the most frequent values (character strings) and its frequency.

The intended results is a dataframe with three columns:

char: the names of the original columns
mode: the most frequent value in each char
freq: the frequency of the modes

When there is a tie in frequencies, I want to put all of the qualified values in one cell, separated by a comma. -- Or is there any better representation?

Questions: I don't know how to deal with a tie.

I have used the table() function to get the frequency tables of each column.

clean <- read.xlsx("test.xlsx", sheet = "clean") %>% as_tibble()
freqtb <- apply(clean, 2, table)

Here is the second table I got in freqtb:

$休12
个 休 天 饿 
1 33  2  1 

Then I looped through the tables:

freq <- vector()
mode <- vector()
for (tb in freqtb) {

    max = max(tb)
    name = names(tb)[tb==max]

    freq <- append(freq, max)
    mode <- append(mode, name)
}
results <- data.frame(char = names(freqtb), freq = freq, mode=mode)

The mode has a greater length than other vectors, and it cannot attached to results. I bet it is due to ties.

How can can get the same length for this "mode" variable?

Eureka
  • 15
  • 5

1 Answers1

0

You can make some small modifications to the code here to get a Mode function. Then Map over your data frame and rbind the results together

options(stringsAsFactors = F)
set.seed(2)

df.in <- 
  data.frame(
    a = sample(letters[1:3], 10, T),
    b = sample(1:3, 10, T),
    c = rep(1:2, 5))

Mode <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ind <- which(tab == max(tab))
  data.frame(char = ux[ind], freq = tab[ind])
}

do.call(rbind, lapply(df.in, Mode))
#     char freq
# a      c    4
# b      1    4
# c.1    1    5
# c.2    2    5
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38