Count how many times each row is present in numpy.array

Question

I am trying to count a number each row shows in a np.array, for example:

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

Which yields:

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

I like this problem! You may be able to use `np.lexsort` to your advantage, but I am not sure whether the collection after sorting can be done fast enough. — eickenberg, Nov 18 '14 at 20:49

score 13 · Accepted Answer · edited May 23 '17 at 12:13

You can use the answer to this other question of yours to get the counts of the unique items.

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

The last reshape can be simplified a bit with: `unq.view((my_array.dtype, my_array.shape[1]))`; it uses the same sort of multi-item dtype as the first `view`. — hpaulj, Jul 31 '16 at 19:45
Does this have a benefit over `np.unique` with `axis` parameter? (Which may have been added after this question was written) — endolith, Aug 07 '19 at 14:10
I keep getting a type error "This axis arguement is unique is not supported for dtype object". How can I fix this? — Charlie Vagg, Nov 03 '20 at 11:57

Yuya Takashina · Answer 2 · 2018-09-28T09:57:56.810

13

I think just specifying axis in np.unique gives what you need.

import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)

Note: this feature is available only in numpy>=1.13.0.

edited Sep 28 '18 at 09:57

answered Sep 28 '18 at 08:45

Yuya Takashina

592
6
13

This seems to be the best solution for ```numpy>=1.13.0```. – jwalton Jun 21 '20 at 16:09

Alex Riley · Answer 3 · 2015-12-06T21:53:48.837

5

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

Here's a short NumPy way to count how many times each row appears in an array:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

edited Dec 06 '15 at 21:53

answered Nov 18 '14 at 18:13

Alex Riley

169,130
45
262
238

With `n=np.arange(my_array.shape[0])` one can obtain a nice result also by writing `[n[ui] for ui in (my_array[:,np.newaxis,:] == my_array).all(axis=2)]`... Nice answer, I've have already half understood it, but what puzzles me it's how you come out with the solution! – gboffi Nov 18 '14 at 22:49

score 3 · Answer 4 · answered Nov 18 '14 at 17:32

3

A pandas approach might look like this

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

answered Nov 18 '14 at 17:32

Bob Haffner

8,235
1
36
43

i have no idea why this got downvoted. This is a good example of how to do this using Pandas. – JD Long Nov 18 '14 at 17:57
Can you show how you would do it without supplying columns names? – Akavall Nov 18 '14 at 18:07
Just omit the columns arg in the DataFrame() and use [0,1,2,3,4,5] in the group by(). [0,1,2,3,4,5] will the default column names that pandas assigns. – Bob Haffner Nov 18 '14 at 18:24
Got it! Thanks, I was trying to pass `np.arange(6)`, and that was not giving me what I wanted, but passing a list works. Thanks. – Akavall Nov 18 '14 at 18:32

elyase · Answer 5 · 2014-11-18T18:36:38.160

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

score 0 · Answer 6 · answered Apr 02 '16 at 19:28

0

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
npi.count(my_array)

answered Apr 02 '16 at 19:28

Eelco Hoogendoorn

10,459
1
44
42

Count how many times each row is present in numpy.array

6 Answers6

Linked