Frequency counts for unique values in a NumPy array

Question

How do I efficiently obtain the frequency count for each unique value in a NumPy array?

>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> freq_count(x)
[(1, 5), (2, 3), (5, 1), (25, 1)]

score 756 · Answer 1 · edited Jun 20 '22 at 02:26

756

Use numpy.unique with return_counts=True (for NumPy 1.9+):

import numpy as np

x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)

>>> print(np.asarray((unique, counts)).T)
 [[ 1  5]
  [ 2  3]
  [ 5  1]
  [25  1]]

In comparison with scipy.stats.itemfreq:

In [4]: x = np.random.random_integers(0,100,1e6)

In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop

In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop

edited Jun 20 '22 at 02:26

Mateen Ulhaq

24,552
19
101
135

answered Sep 19 '14 at 22:54

jme

19,895
6
41
39

If you get the error: TypeError: unique() got an unexpected keyword argument 'return_counts', just do: unique, counts = np.unique(x, True) – NumesSanguis Dec 02 '14 at 14:10
3

@NumesSanguis What version of numpy are you using? Prior to v1.9, the `return_counts` keyword argument didn't exist, which might explain the exception. In that case, [the docs](http://docs.scipy.org/doc/numpy-1.8.0/reference/generated/numpy.unique.html#numpy.unique) suggest that `np.unique(x, True)` is equivalent to `np.unique(x, return_index=True)`, which doesn't return counts. – jme Dec 02 '14 at 15:05
My numpy version is 1.6.1 and that may also explain the strange numbers I get as output in the second row, because those are much highter than the actual count. – NumesSanguis Dec 03 '14 at 14:19
1

In older numpy versions the typical idiom to get the same thing was `unique, idx = np.unique(x, return_inverse=True); counts = np.bincount(idx)`. When this feature was added (see [here](https://github.com/numpy/numpy/pull/4180)) some informal testing had the use of `return_counts` clocking over 5x faster. – Jaime Jan 28 '15 at 01:56
If you are using **ActivePython**, then numpy is very probably out of date. Check current numpy version with `pip list` AND `pypm list`, then run `pypm uninstall numpy` and `pip install numpy`. – KrisWebDev Dec 21 '15 at 09:18
this is probably the most hidden feature. I have to google it every time after I try `np.count`, `np.value_counts`, `np.freq` etc – Ciprian Tomoiagă Apr 04 '19 at 08:10
Unfortunately, this is the extremely slow bottleneck for my application, like 100 times as slow as any other command. Hopefully I can find some faster way. – endolith Aug 18 '19 at 02:07

JoshAdel · Accepted Answer · 2012-05-24T16:53:05.703

201

Take a look at np.bincount:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

And then:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

or:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

or however you want to combine the counts and the unique values.

edited May 24 '12 at 16:53

answered May 24 '12 at 16:35

JoshAdel

66,734
27
141
140

50

Hi, This wouldn't work if elements of x have a dtype other than int. – Manoj Feb 24 '14 at 08:20
8

It won't work if they're anything other than non negative ints, and it will be very space inefficient if the ints are spaced out. – Erik Jun 09 '14 at 06:46
With numpy version 1.10 I found that, for counting integer, it is about 6 times faster than np.unique. Also, note that it does count negative ints too, if right parameters are given. – Jihun Jan 06 '16 at 15:43
@Manoj : My elements x are arrays. I am testing the solution of jme. – Catalina Chircu Feb 13 '20 at 11:04
What would be a good analog then for the `return_inverse` option here? – Yuval Jan 25 '22 at 16:14

score 165 · Answer 3 · edited Jun 20 '22 at 02:30

165

Use this:

>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
    array([[ 1,  5],
           [ 2,  3],
           [ 5,  1],
           [25,  1]])

Original answer:

Use scipy.stats.itemfreq (warning: deprecated):

>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[  1.,   5.],
       [  2.,   3.],
       [  5.,   1.],
       [ 25.,   1.]])

edited Jun 20 '22 at 02:30

Mateen Ulhaq

24,552
19
101
135

answered Aug 13 '13 at 05:41

mckelvin

3,918
1
29
22

2

Seems like the most pythonic approach by far. Also, I encountered issues with "object too deep for desired array" issues with np.bincount on 100k x 100k matrices. – metasequoia Mar 12 '14 at 21:02
1

I rather suggest the original question poser to change the accpted answer from the first one to this one, to increase its visiblity – wiswit Jul 08 '14 at 15:37
It's slow for versions before 0.14, though. – Jason S Sep 25 '14 at 20:47
take note that if the array is full of strings, both elements in each of the items returned are strings too. – user1269942 Jun 23 '15 at 04:48
Looks like itemfreq has been deprecated – Terence Parr Oct 29 '18 at 21:28

Nico Schlömer · Answer 4 · 2021-12-17T18:01:52.587

I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:

y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T

is by far the fastest. (Note the log-scaling.)

Code to generate the plot:

import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq


def bincount(a):
    y = np.bincount(a)
    ii = np.nonzero(y)[0]
    return np.vstack((ii, y[ii])).T


def unique(a):
    unique, counts = np.unique(a, return_counts=True)
    return np.asarray((unique, counts)).T


def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), dtype=int)
    np.add.at(count, inverse, 1)
    return np.vstack((unique, count)).T


def pandas_value_counts(a):
    out = pd.value_counts(pd.Series(a))
    out.sort_index(inplace=True)
    out = np.stack([out.keys().values, out.values]).T
    return out


b = perfplot.bench(
    setup=lambda n: np.random.randint(0, 1000, n),
    kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
    n_range=[2 ** k for k in range(26)],
    xlabel="len(a)",
)
b.save("out.png")
b.show()

Thanks for posting the code to generate the plot. Didn't know about [perfplot](https://pypi.python.org/pypi/perfplot) before now. Looks handy. — ruffsl, Mar 07 '18 at 22:57
I was able to run your code by adding the option `equality_check=array_sorteq` in `perfplot.show()`. What was causing an error ( in Python 2) was `pd.value_counts` (even with sort=False). — user2314737, Jun 17 '18 at 17:38

ivankeller · Answer 5 · 2019-11-09T21:53:07.357

51

Using pandas module:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

edited Nov 09 '19 at 21:53

answered Jul 30 '14 at 10:25

ivankeller

1,923
1
19
20

5

pd.Series() is not necessary. Otherwise, good example. Numpy as well. Pandas can take a simple list as input. – Yohan Obadia Apr 23 '17 at 08:06
1

@YohanObadia - depending on the size f the array, first converting it to a series has made the final operation faster for me. I would guess at the mark of around 50,000 values. – n1k31t4 Oct 30 '18 at 11:50
1

I edited my answer to take into account the relevant comment from @YohanObadia – ivankeller Nov 09 '19 at 21:54
```df = pd.DataFrame(x) df = df.astype('category') print(df.describe()) ``` will give info like ```count 10 unique 4 top 1 freq 5 ```, which can be useful – Subham Mar 15 '21 at 07:43

Eelco Hoogendoorn · Answer 6 · 2014-01-17T14:06:50.980

20

This is by far the most general and performant solution; surprised it hasn't been posted yet.

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.

edited Jan 17 '14 at 14:06

answered Jan 14 '14 at 21:53

Eelco Hoogendoorn

10,459
1
44
42

does not work: `AttributeError: 'numpy.ufunc' object has no attribute 'at'` – P.R. Feb 27 '14 at 16:29
A simpler method would be to call `np.bincount(inverse)` – ali_m Oct 26 '15 at 13:55

score 15 · Answer 7 · answered May 24 '12 at 17:22

numpy.bincount is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

For example:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

score 8 · Answer 8 · answered Nov 05 '12 at 12:22

Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram. Such function given a sequence it returns the frequency of its elements grouped in bins.

Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

score 5 · Answer 9 · edited Aug 04 '15 at 19:58

Old question, but I'd like to provide my own solution which turn out to be the fastest, use normal list instead of np.array as input (or transfer to list firstly), based on my bench test.

Check it out if you encounter it as well.

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

For example,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000 loops, best of 3: 2.26 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000 loops, best of 3: 8.8 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000 loops, best of 3: 5.85 µs per loop

While the accepted answer would be slower, and the scipy.stats.itemfreq solution is even worse.

A more indepth testing did not confirm the formulated expectation.

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.

This answer is really good, as it shows `numpy` is not necessarily the way to go. — Mahdi, Apr 21 '15 at 08:21
@Rain Lee interesting. Have you cross-validated the list-hypothesis also on some non-cache-able dataset size? Lets assume a 150.000 random items in either representation and measured a bit more accurate on a single run as by an example of **aZmqStopwatch.start();count(aRepresentation);aZmqStopwatch.stop()** ? — user3666197, Aug 04 '15 at 19:17
Did some testing and yes, there are **huge differences** in real dataset performance. Testing requires a bit more insight into python internal mechanics than running just a brute-force scaled loops and quote non realistic *in-vitro* nanoseconds. As tested - a **np.bincount()** can be made to handle 150.000 array within **less than 600 [us]** while the above **def**-ed **count()** on a pre-converted list representation thereof took more than **122.000 [us]** — user3666197, Aug 04 '15 at 19:40
Yeah, my rule-of-thumb is **numpy** for anything that can handle small amounts of latency but has the potential to be very large, **lists** for smaller data sets where latency critical, and of course **real benchmarking** FTW :) — David, Sep 08 '15 at 19:28

score 5 · Answer 10 · answered May 18 '17 at 07:56

5

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

This gives you: {1: 5, 2: 3, 5: 1, 25: 1}

answered May 18 '17 at 07:56

Kerem T

260
4
6

1

`collections.Counter(x)` also give the same result. I believe the OP wants a output that resembles R `table` function. Keeping the `Series` may be more useful. – pylang May 18 '17 at 08:12
Please note that it would be necessary to transfer to `pd.Series(x).reshape(-1)` if it is a multidimensional array. – natsuapo Dec 10 '19 at 09:48

jmetz · Answer 11 · 2014-04-28T15:09:31.793

To count unique non-integers - similar to Eelco Hoogendoorn's answer but considerably faster (factor of 5 on my machine), I used weave.inline to combine numpy.unique with a bit of c-code;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

Profile info

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco's pure numpy version:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

Note

There's redundancy here (unique performs a sort also), meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop.

score 3 · Answer 12 · answered Jun 12 '19 at 10:30

multi-dimentional frequency count, i.e. counting arrays.

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,

score 2 · Answer 13 · edited Sep 14 '19 at 10:18

2

import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())

edited Sep 14 '19 at 10:18

Andrew Regan

5,087
6
37
73

answered May 31 '18 at 04:48

RAJAT BHATHEJA

587
4
4

score 1 · Answer 14 · answered Mar 02 '20 at 16:16

1

from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]

answered Mar 02 '20 at 16:16

Yichang Wu

61
5

score 1 · Answer 15 · answered Aug 26 '20 at 08:14

Most of simple problems get complicated because simple functionality like order() in R that gives a statistical result in both and descending order is missing in various python libraries. But if we devise our thinking that all such statistical ordering and parameters in python are easily found in pandas, we can can result sooner than looking in 100 different places. Also, development of R and pandas go hand-in-hand because they were created for same purpose. To solve this problem I use following code that gets me by anywhere:

unique, counts = np.unique(x, return_counts=True)
d = {'unique':unique, 'counts':count}  # pass the list to a dictionary
df = pd.DataFrame(d) #dictionary object can be easily passed to make a dataframe
df.sort_values(by = 'count', ascending=False, inplace = True)
df = df.reset_index(drop=True) #optional only if you want to use it further

score 0 · Answer 16 · edited May 23 '17 at 12:26

0

some thing like this should do it:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

Also, this previous post on Efficiently counting unique elements seems pretty similar to your question, unless I'm missing something.

edited May 23 '17 at 12:26

Community

1
1

answered May 24 '12 at 16:32

benjaminmgross

2,052
1
24
29

The linked question is kinda similar, but it looks like he's working with more complicated data types. – Abe May 24 '12 at 16:44

score 0 · Answer 17 · answered Oct 17 '22 at 08:19

0

You can write freq_count like this:

def freq_count(data):
    mp = dict();
    for i in data:
        if i in mp:
            mp[i] = mp[i]+1
        else:
            mp[i] = 1
    return mp

answered Oct 17 '22 at 08:19

Morteza Jalambadani

2,190
6
21
35

Frequency counts for unique values in a NumPy array

17 Answers17

Linked

Related