17

I am trying to count a number each row shows in a np.array, for example:

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

Which yields:

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • I like this problem! You may be able to use `np.lexsort` to your advantage, but I am not sure whether the collection after sorting can be done fast enough. – eickenberg Nov 18 '14 at 20:49

6 Answers6

13

You can use the answer to this other question of yours to get the counts of the unique items.

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
Community
  • 1
  • 1
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • The last reshape can be simplified a bit with: `unq.view((my_array.dtype, my_array.shape[1]))`; it uses the same sort of multi-item dtype as the first `view`. – hpaulj Jul 31 '16 at 19:45
  • Does this have a benefit over `np.unique` with `axis` parameter? (Which may have been added after this question was written) – endolith Aug 07 '19 at 14:10
  • I keep getting a type error "This axis arguement is unique is not supported for dtype object". How can I fix this? – Charlie Vagg Nov 03 '20 at 11:57
13

I think just specifying axis in np.unique gives what you need.

import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)

Note: this feature is available only in numpy>=1.13.0.

Yuya Takashina
  • 592
  • 6
  • 13
5

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

Here's a short NumPy way to count how many times each row appears in an array:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • With `n=np.arange(my_array.shape[0])` one can obtain a nice result also by writing `[n[ui] for ui in (my_array[:,np.newaxis,:] == my_array).all(axis=2)]`... Nice answer, I've have already half understood it, but what puzzles me it's how you come out with the solution! – gboffi Nov 18 '14 at 22:49
3

A pandas approach might look like this

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
  • i have no idea why this got downvoted. This is a good example of how to do this using Pandas. – JD Long Nov 18 '14 at 17:57
  • Can you show how you would do it without supplying columns names? – Akavall Nov 18 '14 at 18:07
  • Just omit the columns arg in the DataFrame() and use [0,1,2,3,4,5] in the group by(). [0,1,2,3,4,5] will the default column names that pandas assigns. – Bob Haffner Nov 18 '14 at 18:24
  • Got it! Thanks, I was trying to pass `np.arange(6)`, and that was not giving me what I wanted, but passing a list works. Thanks. – Akavall Nov 18 '14 at 18:32
3

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

elyase
  • 39,479
  • 12
  • 112
  • 119
0

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
npi.count(my_array)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42