17

I am trying to retrieve the most repeated value in a particular column present in a data frame.Here is my sample data and code below.A

data("Forbes2000", package = "HSAUR")
head(Forbes2000)


  rank                name        country             category  sales profits  assets marketvalue
1    1           Citigroup  United States              Banking  94.71   17.85 1264.03      255.30
2    2    General Electric  United States        Conglomerates 134.19   15.59  626.93      328.54
3    3 American Intl Group  United States            Insurance  76.66    6.46  647.66      194.87
4    4          ExxonMobil  United States Oil & gas operations 222.88   20.96  166.99      277.02
5    5                  BP United Kingdom Oil & gas operations 232.57   10.27  177.57      173.54
6    6     Bank of America  United States              Banking  49.01   10.81  736.45      117.55

As per my sample data I need to return the most repeated category which is Insurance.

subset(subset(Forbes2000,country=="Bermuda")
mnel
  • 113,303
  • 27
  • 265
  • 254
Teja
  • 13,214
  • 36
  • 93
  • 155
  • How about `sort(table(yourdata$category), decreasing=TRUE)[1]`. There are lots of other ways too! – Justin Aug 29 '12 at 22:09
  • I need to return the most repeated value from my data... – Teja Aug 29 '12 at 22:12
  • 2
    I thought I'd leave that to the reader as an exercise. `names(sort(table(yourdata$category), decreasing=TRUE)[1])`. But Josh makes a good point below, what if you've got a tie! – Justin Aug 29 '12 at 22:24

9 Answers9

21
tail(names(sort(table(Forbes2000$category))), 1)
ALiX
  • 1,021
  • 5
  • 9
12

In case two or more categories may be tied for most frequent, use something like this:

x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance" 
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
5

Another way with the data.table package, which is faster for large data sets:

set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)

method 1 (solution proposed above)

start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 4.883488 secs

method 2 (DATA TABLE)

start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]

most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Time difference of 0.328033 secs

Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
  • 7
    tucson, nice. I think `as.data.table(ds)[, .N, by=x][, x[N == max(N)]]` also does the job, which takes 0.06s on my laptop. As a FYI, no need to `setkey` for aggregations. – Arun Jul 20 '14 at 07:50
  • @Arun Thank you. Your solution should be on top of this page. – Timothée HENRY Jul 20 '14 at 08:55
1

I know my answer is coming a little late, but I built the following function that does the job in less than a second for my dataframe that contains more than 50,000 rows:

print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F, 
                                         sort_desc = T, return_most_frequent_value = F)
{
  temp <- df[column_name]
  output <- as.data.frame(table(temp))
  names(output) <- c("Item","Frequency")
  output_df <- output[  output[[2]] > remove_items_with_freq_equal_or_lower_than,  ]

  if (sort_desc){
    output_df <- output_df[order(output_df[[2]], decreasing = T), ]
  }

  cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
  print(head(output_df))

  if (return_df){
    return(output_df)
  }

  if (return_most_frequent_value){
      output_df$Item <- as.character(output_df$Item)
      output_df$Frequency <- as.numeric(output_df$Frequency)
      most_freq_item <- output_df[1, "Item"]
      cat("\nReturning most frequent item: ", most_freq_item)
      return(most_freq_item)
  }
}

so if you have a dataframe called "df" and a column called "name" and you want to know the most comment value in the "name" column, you could run:

most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)    
Angelo
  • 1,594
  • 5
  • 17
  • 50
1

you can create a function:

get_mode <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

and then do

get_mode(Forbes3000$category)

The reason I created a function is that I have to this kind of thing very often.

Malvika
  • 61
  • 6
1

The following is the easiest (for me) to read and to remember:

names(which.max(table(Forbes2000$category)))

Extra notes on efficiency: This approach avoids sorting the table entries (finding the max is cheaper than a full sort). The most efficient solution would avoid a full tabulation. You can imagine an Rcpp solution that loops through the source vector and keeps a running tabulation but stops before the end, when the contest is already over. If anyone writes that solution, ping me so I can give you a +1 and edit this answer to reference your answer.

scottkosty
  • 2,410
  • 1
  • 16
  • 22
1

Using the function option from @Malvika makes it easy to apply across a table and get these values for every column

#create a mode function
get_mode_name <- function(x){
  return(names(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_value <- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1]))
}

get_mode_pct<- function(x){
  return(unname(sort(table(x), decreasing = T, na.last = T)[1])/length(x))
}

#Identify character columns
type_table <- sapply(table_name, class)

#create vector numeric and character types
num_table <- (unname(type_table) == "numeric")
char_table <- (unname(type_table) == "character")

#View the modes of character columns
mode_name <- apply(table_name[,char_table], 2, function(x) get_mode_name(x))    
mode_value <- apply(table_name[,char_table], 2, function(x) get_mode_value(x))
mode_pct <- apply(table_name[,char_table], 2, function(x) get_mode_pct(x))
Tyler Knight
  • 183
  • 1
  • 9
0

You can use table(Forbes2000$CategoryName, useNA="ifany"). This will give you the list of all possible values in the chosen category and the number of times each value was used in that particular data frame.

lorem monkey
  • 3,942
  • 3
  • 35
  • 49
0

I suggest Rfast::Table.

Rfast::Table(as.character(Forbes2000$CategoryName))

the you can get the maximum value.

Manos Papadakis
  • 564
  • 5
  • 17