Most efficient way to find mode in numpy array

Question

I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.

So if the array is like:

1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1

The result should be

1 3 2 2 2 1

Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.

I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.

There is http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mode.html and the answer here: http://stackoverflow.com/questions/6252280/find-the-most-frequent-number-in-a-numpy-vector — tom10, May 02 '13 at 05:35
@tom10: You mean [scipy.stats.mode()](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html#scipy.stats.mode), right? The other one seems to output a masked array. — fgb, May 02 '13 at 05:53

score 194 · Accepted Answer · edited Mar 11 '16 at 13:22

194

Check scipy.stats.mode() (inspired by @tom10's comment):

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
              [5, 2, 2, 1, 4, 1],
              [3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Output:

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:

print(m[0])

Output:

[[1 3 2 2 1 1]]

edited Mar 11 '16 at 13:22

bugmenot123

1,069
1
18
33

answered May 02 '13 at 05:50

fgb

3,009
1
18
23

4

So numpy by itself does not support any such functionality? – Nik May 02 '13 at 06:51
1

Apparently not, but [scipy's implementation relies only on numpy](http://stackoverflow.com/questions/12399107/alternative-to-scipy-mode-function-in-numpy), so you could just copy that code into your own function. – fgb May 02 '13 at 06:53
17

Just a note, for people who look at this in the future: you need to `import scipy.stats` explicitly, it is not included when you simply do an `import scipy`. – ffledgling Aug 15 '13 at 12:42
1

Can you please explain how exactly it is displaying the mode values and count ? I couldn't relate the output with the input provided. – Rahul Dec 06 '17 at 11:54
@Osgux: what doesn't work? I just re-run the code in the answer against python 2.7.14, numpy 1.13.3, and scipy 1.0.0 without a problem. – fgb Dec 06 '17 at 17:27
@Rahul: I'm not sure I follow. The output is a ModeResult object with mode and count arrays displayed above. – fgb Dec 06 '17 at 17:29
@fgb sorry It was a problem on my laptop because I have 2 version of python =/ – marti_ Dec 06 '17 at 21:16
@Osgux: no worries; glad you could resolve the issue! – fgb Dec 06 '17 at 21:18
@fgb: we actually think mode as the frequency of a value in a given set of values? I am not getting how output is being displayed. Let's say for example, if it is 1D-array [1,2,2,4,5] then the mode will be 2. But in 2D-array as in above example its printing "mode=array([[1, 3, 2, 2, 1, 1]])". Why these many "1s or 2s" ? How it derives the output ? – Rahul Dec 07 '17 at 05:31
5

@Rahul: you have to consider the default second argument of `axis=0`. The above code is reporting the mode per column of the input. The count is telling us how many times it has seen the reported mode in each of the columns. If you wanted the overall mode, you need to specify `axis=None`. For further info, please refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html – fgb Jan 10 '18 at 22:04
@fgb Thanks for the explanation. Understood it clearly now. – Rahul Jan 17 '18 at 07:00
Until https://github.com/scipy/scipy/pull/8294, `scipy.stats.mode` was very slow for some cases, to the point that the more general `find_repeats` can be faster: https://github.com/scipy/scipy/issues/3035 – Achal Dave Oct 27 '18 at 15:09

Devin Cairns · Answer 2 · 2019-05-28T03:46:32.360

Update

The scipy.stats.mode function has been significantly optimized since this post, and would be the recommended method

Old answer

This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincount is handy, along with numpy.unique with the return_counts arg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Result:

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
                         [5, 2, 2, 1, 4, 1],
                         [3, 3, 2, 2, 1, 1]])

In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

Some benchmarks:

In [4]: import scipy.stats

In [5]: a = numpy.random.randint(1,10,(1000,1000))

In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop

In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop

In [8]: a = numpy.random.randint(1,500,(1000,1000))

In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop

In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop

In [11]: a = numpy.random.random((200,200))

In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop

In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop

EDIT: Provided more of a background and modified the approach to be more memory-efficient

Please do contribute it to scipy's stat module so others also could benefit from it. — ARF, Feb 04 '19 at 09:07
For higher dimensional problems with big int ndarrays, your solution seems to be still much faster than scipy.stats.mode. I had to compute the mode along the first axis of a 4x250x250x500 ndarray, and your function took 10s, while scipy.stats.mode took almost 600s. — CheshireCat, Apr 29 '20 at 07:59
I concur with the comment above. Your function is still faster than scipy's implementation for larger matrices (though the performance I get from scipy is way better than 600s for me). — William Abma, Feb 23 '21 at 05:08
for those who want to avoid the debug cycle triggered by the over-OOP'd return type, `scipy.stats.mode(arr).mode[0]` is the answer. That is: the mode is found with the call `mode(arr).mode[0]`, but you might have to catch `ValueError`s for zero length arrays and wrap in your own `mode` function, so you'll have to alias the scipy mode or import stats and call it from there. — Chris, Dec 01 '21 at 22:09

score 36 · Answer 3 · answered Apr 17 '20 at 02:34

36

If you want to use numpy only:

x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)

gives

(array([-1,  1,  2,  3]), array([1, 1, 1, 2]))

And extract it:

index = np.argmax(counts)
return vals[index]

answered Apr 17 '20 at 02:34

poisonedivy

479
4
7

2

Like this method because it supports not only integers, but also float and even strings! – Christopher Jul 05 '20 at 11:31

score 18 · Answer 4 · answered May 08 '19 at 23:54

18

A neat solution that only uses numpy (not scipy nor the Counter class):

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])

np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

array([1, 3, 2, 2, 1, 1])

answered May 08 '19 at 23:54

Def_Os

5,301
5
34
63

3

Nice and concise, but should be used with caution if the original arrays contain a very large number because bincount will create bin arrays with len( max(A[i]) ) for each original array A[i]. – scottlittle Jan 06 '20 at 22:15
1

This is an awesome solution. There is actually a drawback in `scipy.stats.mode`. When there are multiple values having the most occurrence (multiple modes), it will throw an expectation. But this method will automatically take the "first mode". – Christopher Jul 05 '20 at 11:23

score 13 · Answer 5 · edited May 23 '17 at 11:54

13

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.

edited May 23 '17 at 11:54

Community

1
1

answered May 06 '17 at 08:30

Lean Bravo

361
3
5

When does np.argmax ever return something with length greater than 1 if you don't specify an axis? – loganjones16 Nov 22 '19 at 04:57

score 11 · Answer 6 · edited Dec 08 '22 at 08:20

11

simplest way in Python to get the mode of an list or array a

import statistics
a=[7,4,4,4,4,25,25,6,7,4867,5,6,56,52,32,44,4,4,44,4,44,4]
print(f"{statistics.mode(a)} is the mode (most frequently occurring number)")

That's it

edited Dec 08 '22 at 08:20

Nav

19,885
27
92
135

answered Mar 25 '20 at 10:47

Ashutosh K Singh

269
4
9

2

Thank you. why is this not the TOP answer? – John Henckel Apr 23 '22 at 02:20
Top solution, thanks @Ashutosh K Singh – Román Jul 25 '22 at 19:34
The person wants mode along an axis. statistics.mode(arr, axis=0) gives an error. – nadya Feb 03 '23 at 19:57

score 3 · Answer 7 · edited Apr 25 '18 at 01:51

I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.

For 1-d arrays:

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

For multiple dimensional arrays (little difference):

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 
nparr = nparr.reshape((10,2,5))     #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1)  # just use .flatten() method

# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

This may or may not be an efficient implementation, but it is convenient.

Zeliha Bektas · Answer 8 · 2019-08-23T11:03:16.387

2

from collections import Counter

n = int(input())
data = sorted([int(i) for i in input().split()])

sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]

print(Mean)

The Counter(data) counts the frequency and returns a defaultdict. sorted(Counter(data).items()) sorts using the keys, not the frequency. Finally, need to sorted the frequency using another sorted with key = lambda x: x[1]. The reverse tells Python to sort the frequency from the largest to the smallest.

edited Aug 23 '19 at 11:03

answered Aug 23 '19 at 10:06

Zeliha Bektas

21
5

Since the question was asked 6 years ago, it is normal that he did not receive much reputation. – Zeliha Bektas Aug 23 '19 at 11:07

score 1 · Answer 9 · answered Feb 22 '21 at 13:32

if you want to find mode as int Value here is the easiest way I was trying to find out mode of Array using Scipy Stats but the problem is that output of the code look like:

ModeResult(mode=array(2), count=array([[1, 2, 2, 2, 1, 2]])) , I only want the Integer output so if you want the same just try this

import numpy as np
from scipy import stats
numbers = list(map(int, input().split())) 
print(int(stats.mode(numbers)[0]))

Last line is enough to print Mode Value in Python: print(int(stats.mode(numbers)[0]))

score 1 · Answer 10 · edited Apr 09 '23 at 16:29

1

If you wish to use only numpy and do it without using the index of the array, the following implementation combining dictionaries with numpy can be used.

x = np.array([1, 1, 2, 3])
val, count = np.unique(x,return_counts=True)
freq = {}
for v, c in zip(val, count):
  freq[v] = c
mode = sorted(freq.items(),key =lambda kv :kv[1])[-1] # (1, 2)
print(mode[0]) # prints 1 (most frequent item, mode)

edited Apr 09 '23 at 16:29

Prashant Ghimire

4,890
3
35
46

answered Sep 21 '21 at 12:00

Dinesh Marimuthu

107
1
4

score 0 · Answer 11 · answered Dec 19 '22 at 06:34

0

Finding Mode using dict in python

def mode(x):
  d={}
  k=0
  v=0
  for i in x:
    d[i]=d.get(i,0)+1
    if d[i]>v:
      k=i
      v=d[i]
  print(d)
  return k

print(mode(x))

answered Dec 19 '22 at 06:34

Jayanth

27
1
5

normanius · Answer 12 · 2023-06-21T12:05:52.447

NumPy does not provide a dedicated method for calculating the mode of some data. One reason for this could be that the mode is often used for non-numeric, categorical variables, while NumPy is focused on numeric calculations.

Here is an alternative using pandas.DataFrame.mode(). It supports mixed-type data, see further below for an example.

import pandas as pd
data = [[1, 3, 4, 2, 2, 7],
        [5, 2, 2, 1, 4, 1],
        [3, 3, 2, 2, 1, 1]])
df = pd.DataFrame(data)
df.mode()

#    0    1    2    3  4    5
# 0  1  3.0  2.0  2.0  1  1.0
# 1  3  NaN  NaN  NaN  2  NaN
# 2  5  NaN  NaN  NaN  4  NaN

Here, we are interested only in the first row. To fetch it, use one of the following:

modes = df.mode().values[0]    #  array([1., 3., 2., 2., 1., 1.])
modes = df.mode().iloc[0]      #  pd.Series(...)

Details:

By default, pandas computes the column-wise modes. One can compute the row-wise modes by passing the argument axis=1: df.mode(axis=1)
Starting with SciPy 1.9, the support for non-numeric data has been deprecated and will not be possible for SciPy >=1.11. See docs of scipy.stats.mode(). SciPy recommends using the Pandas approach.
Pandas sorts the modes if there are multiple ones. If we just use the first row of the resulting DataFrame, we slightly deviate from the OPs question, who requested to pick one randomly. Of course, we can fix this, see below.
The function mode() yields all possible modes if there are more than one, and stores them in a DataFrame. Unfortunately, this results in NaN values for the columns with fewer modes than the column with the maximal number of modes. In order to accommodate the NaNs, Pandas converts the dtype of the columns from int to float, which I consider a bit ugly. To recover from this, we need to force the original dtype. The code below shows how to do this.

Fix 1: Recover from typecast int → float:

# Works for both np.ndarray, pd.Series
modes.astype(int)

# For a mixed-type DataFrame, one could do the following:
# (Works only for column-wise modes)
[dtype.type(m) for m, dtype in zip(modes, df.dtypes)]

Fix 2: Pick a mode at random if there are multiple

modes = df.mode().apply(lambda x: np.random.choice(x.dropna()))

Example: Mixed-type data

import numpy as np
import pandas as pd

data = {"col1": ["foo", "bar", "baz", "foo", "bar", "foo", "bar", "baz"],
        "col2": [10, 0, 0, 10, 10, 10, 0, 10],
        "col3": [42., 14., 0.1, 1., 1., 4., 42., 14.],
        "col4": [False, False, False, True, True, True, False, True],
        "col5": [None, "abc", "abc", None, "def", "def", None, "abc"],
        "col6": [1.2, None, 1.2, 2.3, None, 2.3, 1.2, 2.3] }

df = pd.DataFrame(data)
#          col1     col2     col3   col4    col5     col6
#     0     foo       10     42.0  False    None      1.2
#     1     bar        0     14.0  False     abc      NaN
#     2     baz        0      0.1  False     abc      1.2
#     3     foo       10      1.0   True    None      2.3
#     4     bar       10      1.0   True     def      NaN
#     5     foo       10      4.0   True     def      2.3
#     6     bar        0     42.0  False    None      1.2
#  
# dtype  object    int64  float64   bool  object  float64

modes = df.mode()
#          col1     col2     col3    col4    col5     col6
#     0     bar     10.0      1.0   False     abc      1.2
#     1     foo      NaN     14.0    True     NaN      2.3
#     2     NaN      NaN     42.0     NaN     NaN      NaN
#
# dtype  object  float64  float64  object  object  float64

Note how the Nones are handled in the data, how multiple modes are sorted, and that the dtypes for col2 and col4 have changed.

Finally, we can fix the typecast and pick the mode at random if there are multiple:

modes_fixed = modes.apply(lambda x: np.random.choice(x.dropna()))
modes_fixed = [dtype.type(m) for m, dtype in zip(modes_fixed, df.dtypes)]
# ['foo', 10, 14.0, False, 'abc', 2.3]

Most efficient way to find mode in numpy array

12 Answers12

Details:

Fix 1: Recover from typecast int → float:

Fix 2: Pick a mode at random if there are multiple

Example: Mixed-type data

Linked

Related