63

I have a vector say

c(1,1,1,1,1,1,2,3,4,5,7,7,5,7,7,7)

How do I count each element, and then return the e.g. 3 most common elements, i.e. 1, 7, 5?

Henrik
  • 65,555
  • 14
  • 143
  • 159
ChairmanMeow
  • 843
  • 3
  • 10
  • 12
  • 1
    you can use table().. see http://stackoverflow.com/questions/1923273/counting-the-number-of-elements-with-the-values-of-x-in-a-vector – cobie Jun 28 '13 at 22:38

5 Answers5

107

I'm sure this is a duplicate, but the answer is simple:

sort(table(variable),decreasing=TRUE)[1:3]
Thomas
  • 43,637
  • 12
  • 109
  • 140
14

I don't know if this is better than the table approach, but if your list is already a factor then its summary method will give you frequency counts:

> summary(as.factor(c(1,1,1,1,1,1,2,3,4,5,7,7,5,7,7,7)))
1 2 3 4 5 7 
6 1 1 1 2 5 

And then you can get the top 3 most frequent like so:

> names(sort(summary(as.factor(c(1,1,1,1,1,1,2,3,4,5,7,7,5,7,7,7))), decreasing=T)[1:3])
[1] "1" "7" "5"
qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
9

If your vector contains only integers, tabulate will be much faster than anything else. There are a couple of catches to be aware of:

  • It'll by default return the count for numbers from 1 to N.
  • It'll return an unnamed vector.

That means, if your x = c(1,1,1,3) then tabulate(x) will return (3, 0, 1). Note that the counts are for 1 to max(x) by default.

How can you use tabulate to make sure that you can pass any numbers?

set.seed(45)
x <- sample(-5:5, 25, TRUE)
#  [1]  1 -2 -3 -1 -2 -2 -3  1 -3 -5 -1  4 -2  0 -1 -1  5 -4 -1 -3 -4 -2  1  2  4

Just add abs(min(x))+1 when min(x) <= 0 to make sure that the values start from 1. If min(x) > 0, then just use tabulate directly.

sort(setNames(tabulate(x + ifelse(min(x) <= 0, abs(min(x))+1, 0)), 
      seq(min(x), max(x))), decreasing=TRUE)[1:3]

If your vector does contain NA, then you can use table with useNA="always" parameter.

Arun
  • 116,683
  • 26
  • 284
  • 387
2

you can use table() function to get a tabulation of the frequency of values in an array/vector and then sort this table.

x = c(1, 1, 1, 2, 2)
sort(table(x))
2 1
2 3
cobie
  • 7,023
  • 11
  • 38
  • 60
0

I have gathered a few answers to this question over different threads, and run a microbenchmark comparison (on a Windows-based computing server running R 4.3.0 with 512 Gb RAM and 2 AMD processors with 24 cores).

I compare 4 methods, based on dplyr, data.table, base R, and Rfast:

library(dplyr)

# dplyr based function
mostfreqval1 <- function(x,k=1){
  tibble(v=x) %>% count(v) %>% arrange(desc(n)) %>% slice(1:k) %>% pull(v)
}

# using data.table 
mostfreqval2 <- function(x,k=1){
  require(data.table)
  ds <- data.table(x)
  setkey(ds,x)
  sorted <- ds[,.N,by=list(x)]
  return(sorted[order(-N)]$x[1:k])
}

# using Base R
mostfreqval3 <- function(x,k=1){
  x %>% table() %>% sort(decreasing=TRUE) %>% names() %>% head(k) 
}

# Base R boosted by Rfast
mostfreqval4 <- function(x,k=1){
  x %>% Rfast::Table() %>% sort(decreasing=TRUE) %>% names() %>% head(k)
}

set.seed(123)
myvec <- sample(letters[1:10], 1e6, replace=TRUE)

microbenchmark::microbenchmark(myvec %>% mostfreqval1(k=3))
microbenchmark::microbenchmark(myvec %>% mostfreqval2(k=3))
microbenchmark::microbenchmark(myvec %>% mostfreqval3(k=3))
microbenchmark::microbenchmark(myvec %>% mostfreqval4(k=3))

All functions return the same output: a vector c("d","j","a").

In terms of speed, the results are the following:

> microbenchmark::microbenchmark(myvec %>% mostfreqval1(k=3))
Unit: milliseconds
                          expr     min      lq     mean   median      uq     max neval
 myvec %>% mostfreqval1(k = 3) 22.2394 23.5893 24.00849 24.08615 24.4052 25.6265   100
> microbenchmark::microbenchmark(myvec %>% mostfreqval2(k=3))
Unit: milliseconds
                          expr     min       lq     mean  median      uq     max neval
 myvec %>% mostfreqval2(k = 3) 23.7754 24.44535 24.84656 24.7828 25.1395 26.6308   100
> microbenchmark::microbenchmark(myvec %>% mostfreqval3(k=3))
Unit: milliseconds
                          expr     min      lq     mean   median       uq     max
 myvec %>% mostfreqval3(k = 3) 41.9721 42.3926 43.45681 42.62645 43.23245 49.9802
 neval
   100
> microbenchmark::microbenchmark(myvec %>% mostfreqval4(k=3))
Unit: milliseconds
                          expr     min       lq     mean  median      uq     max neval
 myvec %>% mostfreqval4(k = 3) 19.0955 19.12415 19.19986 19.1526 19.2288 19.9925   100

With these specifications, the dplyr-based, data.table-based and Rfast-based solutions are about equivalent in terms of speed. The Base R solution is the slowest one. I have noticed that dplyr and data.table are better for smaller problems (e.g., when the vector size is lower than 1e5) but Rfast catches up for larger vector sizes.

Note that the ranking is similar when the comparison is run on a MacBook Pro, except that the data.table solution becomes significantly slower, probably due the lack of support of OpenMP on MacOS).

Happy to add any other function to this comparison, in case you find it useful.

Roland
  • 377
  • 4
  • 14