Find the most frequent number in a NumPy array

Question

Suppose I have the following NumPy array:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])

How can I find the most frequent number in this array?

For Python lists see [Find the most common element in a list](https://stackoverflow.com/q/1518522/7851470) and [Find the item with maximum occurrences in a list](https://stackoverflow.com/q/6987285/7851470). — Georgy, Nov 22 '20 at 11:56

score 247 · Accepted Answer · edited Nov 18 '20 at 09:40

247

If your list contains all non-negative ints, you should take a look at numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

and then probably use np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))

For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))

edited Nov 18 '20 at 09:40

Bemmu

17,849
16
76
93

answered Jun 06 '11 at 13:01

JoshAdel

66,734
27
141
140

29

To those of us visiting after 2016: I dislike this answer, as bincount(arr) returns an array as large as the largest element in arr, so a small array with a large range would create an excessively large array. Apoengtus's answer below is much better, although I don't think numpy.unique() existed in 2011, when this answer was created. – Wehrdo Mar 13 '16 at 22:03
@Nikolai yeah but chaining is both less readable and less debuggable than having an intermediate `counts` variable, as @JoshAdel wrote – william_grisaitis Jul 13 '16 at 23:54
8

**Python 3**: `Counter(array).most_common(1)[0][0]` – diralik Mar 20 '18 at 19:39
1

@Wehrdo There is a trade-off. `bincount` consumes more memory but is much faster. When the range is increased, `bincount` takes more space but finishes in nearly the same amount time, whereas `unique` needs much more time but takes the same amount of space. – W. Zhu Jan 23 '19 at 03:23
np.bincount assumes all elements to be > 0, so this won't work if there are negative elements in the array – rohaldb Feb 11 '21 at 12:56
Choose the best tool for the job. `bincount` is not universally applicable, but it is really good when the values are positive integers with a small max bound, such as in the case of counting votes. – Fanchen Bao May 12 '23 at 21:15

score 134 · Answer 2 · edited Jan 11 '21 at 12:50

134

You may use

values, counts = np.unique(a, return_counts=True)

ind = np.argmax(counts)
print(values[ind])  # prints the most frequent element

ind = np.argpartition(-counts, kth=10)[:10]
print(values[ind])  # prints the 10 most frequent elements

If some element is as frequent as another one, this code will return only the first element.

edited Jan 11 '21 at 12:50

Thomas Ahle

30,774
21
92
114

answered Feb 26 '15 at 07:29

Apogentus

6,371
6
32
33

9

I find this the most helpful as it is generic, short and allows pulling elements from values or counts by some derived index. – ryanjdillon Sep 22 '15 at 13:16
7

If we have multiple most frequent values, `values[counts.argmax()]` will return the first value. To get all of them, we can use `values[counts == counts.max()]`. – W. Zhu Jan 23 '19 at 03:43

score 53 · Answer 3 · answered Jun 06 '11 at 13:10

53

If you're willing to use SciPy:

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

answered Jun 06 '11 at 13:10

Fred Foo

355,277
75
744
836

iuridiniz · Answer 4 · 2020-05-12T13:04:48.917

Performances (using iPython) for some solutions found here:

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>

Best is 'max' with 'set' for small arrays like the problem.

According to @David Sanders, if you increase the array size to something like 100,000 elements, the "max w/set" algorithm ends up being the worst by far whereas the "numpy bincount" method is the best.

@IuliusCurt in order to point the best approach we need to test it against multiple cases: small arrays, large arrays, random arrays, real world arrays (like [timsort](https://en.wikipedia.org/wiki/Timsort) does for sorting), ... But I agree with you — iuridiniz, Mar 27 '16 at 13:12
Using only a small array, as in your approach, isn't going to distinguish very well between the different algorithms. — David Sanders, Dec 10 '16 at 21:53
If you increase the test list size to 100000 (`a = (np.random.rand(100000) * 1000).round().astype('int'); a_list = list(a)`), your "max w/set" algorithm ends up being the worst by far whereas the "numpy bincount" method is the best. I conducted this test using `a_list` for native python code and `a` for numpy code to avoid marshalling costs screwing up the results. — David Sanders, Dec 10 '16 at 22:16

score 9 · Answer 5 · answered Apr 26 '20 at 07:35

Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.

Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

score 5 · Answer 6 · answered Jun 06 '11 at 13:25

5

Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

answered Jun 06 '11 at 13:25

Artsiom Rudzenka

27,895
4
34
52

2

This is from a while ago, but for posterity: this is equivalent to the easier-to-read `max(set(lVals), key=lVals.count)`, which does an O(n) count for each unique element of `lVals` for approximately O(n^2) (assuming O(n) unique elements). Using `collections.Counter(lVals).most_common(1)[0][0]` from the standard library, as [suggested by JoshAdel](http://stackoverflow.com/a/6252400/344821), is only O(n). – Danica Aug 14 '12 at 16:46

JJC · Answer 7 · 2015-01-08T16:26:22.040

While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren't on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

score 4 · Answer 8 · answered Jun 04 '19 at 16:41

4

In Python 3 the following should work:

max(set(a), key=lambda x: a.count(x))

answered Jun 04 '19 at 16:41

Yury Kliachko

49
1

score 2 · Answer 9 · answered Feb 23 '16 at 19:49

2

I like the solution by JoshAdel.

But there is just one catch.

The np.bincount() solution only works on numbers.

If you have strings, collections.Counter solution will work for you.

answered Feb 23 '16 at 19:49

Vikas

1,900
1
19
20

score 2 · Answer 10 · edited Apr 01 '17 at 21:03

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I've also found that this is much faster than scipy.stats.mode if there are a lot of unique values.

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

score 2 · Answer 11 · edited May 23 '17 at 11:55

2

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1

edited May 23 '17 at 11:55

Community

1
1

answered May 06 '17 at 08:23

Lean Bravo

361
3
5

score 1 · Answer 12 · edited Nov 22 '20 at 11:47

1

You can use the following approach:

x = np.array([[2, 5, 5, 2], [2, 7, 8, 5], [2, 5, 7, 9]])
u, c = np.unique(x, return_counts=True)
print(u[c == np.amax(c)])

This will give the answer: array([2, 5])

edited Nov 22 '20 at 11:47

Cody Gray - on strike

239,200
50
490
574

answered Nov 19 '20 at 07:07

Manprit Singh

29
5

score 0 · Answer 13 · answered Feb 21 '23 at 23:36

0

Using np.bincount and the np.argmax method can get the most common value in a numpy array. If your array is an image array, use the np.ravel or np.flatten() methods to convert a ndarray to a 1-dimensional array.

answered Feb 21 '23 at 23:36

Autodidact

1
2

score -1 · Answer 14 · answered Jan 06 '13 at 17:42

-1

I'm recently doing a project and using collections.Counter.(Which tortured me).

The Counter in collections have a very very bad performance in my opinion. It's just a class wrapping dict().

What's worse, If you use cProfile to profile its method, you should see a lot of '__missing__' and '__instancecheck__' stuff wasting the whole time.

Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.

Btw, numpy's bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.

answered Jan 06 '13 at 17:42

Weichu Liu

143
1
10

3

A dict is the most finely-tuned data structure in Python and is ideal for counting arbitrary objects. In contrast, binning only works on numerical values and doesn't let you prevent aliasing between closely spaced discrete values. In the case of Counter, the \_\_missing\_\_ method is only called when an element is first seen; otherwise, its presence is cost-free. Note, the *most_common()* method is blazingly fast in most cases because the heap is very small compared to the total dataset. In most cases, the *most_common()* method makes only slightly more comparisons than *min()*. – Raymond Hettinger Mar 31 '13 at 21:46

Find the most frequent number in a NumPy array

14 Answers14

Performances (using iPython) for some solutions found here:

Linked

Related