How to retrieve the most repeated value in a column present in a data frame

Question

I am trying to retrieve the most repeated value in a particular column present in a data frame.Here is my sample data and code below.A

data("Forbes2000", package = "HSAUR")
head(Forbes2000)


  rank                name        country             category  sales profits  assets marketvalue
1    1           Citigroup  United States              Banking  94.71   17.85 1264.03      255.30
2    2    General Electric  United States        Conglomerates 134.19   15.59  626.93      328.54
3    3 American Intl Group  United States            Insurance  76.66    6.46  647.66      194.87
4    4          ExxonMobil  United States Oil & gas operations 222.88   20.96  166.99      277.02
5    5                  BP United Kingdom Oil & gas operations 232.57   10.27  177.57      173.54
6    6     Bank of America  United States              Banking  49.01   10.81  736.45      117.55

As per my sample data I need to return the most repeated category which is Insurance.

subset(subset(Forbes2000,country=="Bermuda")

How about `sort(table(yourdata$category), decreasing=TRUE)[1]`. There are lots of other ways too! — Justin, Aug 29 '12 at 22:09
I thought I'd leave that to the reader as an exercise. `names(sort(table(yourdata$category), decreasing=TRUE)[1])`. But Josh makes a good point below, what if you've got a tie! — Justin, Aug 29 '12 at 22:24

score 21 · Accepted Answer · answered Aug 29 '12 at 22:19

21

tail(names(sort(table(Forbes2000$category))), 1)

answered Aug 29 '12 at 22:19

ALiX

1,021
5
9

Josh O'Brien · Answer 2 · 2012-08-29T22:24:19.770

12

In case two or more categories may be tied for most frequent, use something like this:

x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance"

edited Aug 29 '12 at 22:24

answered Aug 29 '12 at 22:14

Josh O'Brien

159,210
26
366
455

score 5 · Answer 3 · answered Jul 19 '14 at 08:35

5

Another way with the data.table package, which is faster for large data sets:

set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)

method 1 (solution proposed above)

start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 4.883488 secs

method 2 (DATA TABLE)

start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]

most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 0.328033 secs

answered Jul 19 '14 at 08:35

Timothée HENRY

14,294
21
96
136

7

tucson, nice. I think `as.data.table(ds)[, .N, by=x][, x[N == max(N)]]` also does the job, which takes 0.06s on my laptop. As a FYI, no need to `setkey` for aggregations. – Arun Jul 20 '14 at 07:50
@Arun Thank you. Your solution should be on top of this page. – Timothée HENRY Jul 20 '14 at 08:55

score 1 · Answer 4 · answered Mar 21 '19 at 15:26

I know my answer is coming a little late, but I built the following function that does the job in less than a second for my dataframe that contains more than 50,000 rows:

print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F, 
                                         sort_desc = T, return_most_frequent_value = F)
{
  temp <- df[column_name]
  output <- as.data.frame(table(temp))
  names(output) <- c("Item","Frequency")
  output_df <- output[  output[[2]] > remove_items_with_freq_equal_or_lower_than,  ]

  if (sort_desc){
    output_df <- output_df[order(output_df[[2]], decreasing = T), ]
  }

  cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
  print(head(output_df))

  if (return_df){
    return(output_df)
  }

  if (return_most_frequent_value){
      output_df$Item <- as.character(output_df$Item)
      output_df$Frequency <- as.numeric(output_df$Frequency)
      most_freq_item <- output_df[1, "Item"]
      cat("\nReturning most frequent item: ", most_freq_item)
      return(most_freq_item)
  }
}

so if you have a dataframe called "df" and a column called "name" and you want to know the most comment value in the "name" column, you could run:

most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)

score 1 · Answer 5 · answered Oct 09 '19 at 19:50

1

you can create a function:

get_mode <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

and then do

get_mode(Forbes3000$category)

The reason I created a function is that I have to this kind of thing very often.

answered Oct 09 '19 at 19:50

Malvika

61
6

I like this solution, makes it easy to apply across a data frame and calculate this quickly for all values. – Tyler Knight Jul 27 '21 at 20:05

score 1 · Answer 6 · answered Mar 06 '21 at 16:45

The following is the easiest (for me) to read and to remember:

names(which.max(table(Forbes2000$category)))

Extra notes on efficiency: This approach avoids sorting the table entries (finding the max is cheaper than a full sort). The most efficient solution would avoid a full tabulation. You can imagine an Rcpp solution that loops through the source vector and keeps a running tabulation but stops before the end, when the contest is already over. If anyone writes that solution, ping me so I can give you a +1 and edit this answer to reference your answer.

Tyler Knight · Answer 7 · 2021-07-27T21:19:39.043

Using the function option from @Malvika makes it easy to apply across a table and get these values for every column

#create a mode function
get_mode_name <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_value <- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_pct<- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1])/length(x))
}

#Identify character columns
type_table <- sapply(table_name, class)

#create vector numeric and character types
num_table <- (unname(type_table) == "numeric")
char_table <- (unname(type_table) == "character")

#View the modes of character columns
mode_name <- apply(table_name[,char_table], 2, function(x) get_mode_name(x))    
mode_value <- apply(table_name[,char_table], 2, function(x) get_mode_value(x))
mode_pct <- apply(table_name[,char_table], 2, function(x) get_mode_pct(x))

score 0 · Answer 8 · edited Aug 07 '17 at 09:27

0

You can use table(Forbes2000$CategoryName, useNA="ifany"). This will give you the list of all possible values in the chosen category and the number of times each value was used in that particular data frame.

edited Aug 07 '17 at 09:27

lorem monkey

3,942
3
35
49

answered May 23 '15 at 21:08

user3430164

1

score 0 · Answer 9 · answered Mar 08 '21 at 19:34

0

I suggest Rfast::Table.

Rfast::Table(as.character(Forbes2000$CategoryName))

the you can get the maximum value.

answered Mar 08 '21 at 19:34

Manos Papadakis

564
5
17

How to retrieve the most repeated value in a column present in a data frame

9 Answers9

Linked

Related