8

I have the following type of arrays:

a = array([[1,1,1],
           [1,1,1],
           [1,1,1],
           [2,2,2],
           [2,2,2],
           [2,2,2],
           [3,3,0],
           [3,3,0],
           [3,3,0]])

I would like to count the number of occurrences of each type of array such as

[1,1,1]:3, [2,2,2]:3, and [3,3,0]: 3 

How could I achieve this in python? Is it possible without using a for loop and counting into a dictionary? It has to be fast and should take less than 0.1 seconds or so. I looked into Counter, numpy bincount, etc. But, those are for individual element not for an array.

Thanks.

user4279562
  • 669
  • 12
  • 25
  • Possible duplicate of [Find unique rows in numpy.array](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array) – Daniel Oct 20 '15 at 16:37

5 Answers5

2

If you don't mind mapping to tuples just to get the count you can use a Counter dict which runs in 28.5 µs on my machine using python3 which is well below your threshold:

In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop

In [6]: c = Counter(map(tuple, a))

In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
2

collections.Counter can do this conveniently, and almost like the example given.

>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
...   c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

This converts each sub-list to a tuple, which can be keys in a dictionary since they are immutable. Lists are mutable so can't be used as dict keys.

Why do you want to avoid using for loops?

And similar to @padraic-cunningham's much cooler answer:

>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
Community
  • 1
  • 1
aneroid
  • 12,983
  • 3
  • 36
  • 66
  • Python loops with numpy should be avoided as they can be many times slower than a numpy solution. Therefore, if you are using numpy and python loops you are likely "doing it wrong". In both the python answers you are copying the entire array to a much less compact datatype which is especially worrying. – Daniel Oct 20 '15 at 13:22
  • @Ophion. do you have a faster numpy solution? – Padraic Cunningham Oct 20 '15 at 14:15
  • 2
    @PadraicCunningham Divakar's solution will be faster for reasonably sized problems. This problem has been asked, answered, and benchmarked many times in the past. I think the canonical answer is [here](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array). – Daniel Oct 20 '15 at 16:32
  • @Ophion Re: "copying the entire array to a much less compact datatype". That's a good point. Though, I'd thought of taking it one further level of worrying by using a hash. But then also figured it to be unnecessarily complicated. One edit/addition I was going to put in the answer was retrieving the count of a certain elem, which is essentially a dict lookup in this case. So to print (for example) one of the counts, instead of `print c[(3, 3, 0)]`, one would put `print c.getcounts(3, 3, 0)` (or override `__getitem__` with the code for `getcounts`) and achieve a similar result. – aneroid Oct 20 '15 at 18:45
  • Also, one has to decide how much they care about having the original items. On a "reasonably sized" array/list of large enough objects with not many repetitions, having one actual copy plus one in the Counter makes it unreasonable. So may be better to lose the value altogether and use a hash or str on the objects, for the dict keys. I was writing a generic `HashedCounter` but realised that by the time all the required methods involving keys are overridden as above, one would end up writing [most of this](http://code.activestate.com/recipes/576611-counter-class/). But not Numpy-specific. – aneroid Oct 20 '15 at 19:33
  • 2
    Added benchmark for a decently large input array case in my solution. That benchmarking trend continued for medium sized inputs as well. I think @Ophion made sense when he talked about NumPy solutions being optimized for such cases that goes with its philosophy. – Divakar Oct 20 '15 at 20:43
2

You could convert those rows to a 1D array using the elements as two-dimensional indices with np.ravel_multi_index. Then, use np.unique to give us the positions of the start of each unique row and also has an optional argument return_counts to give us the counts. Thus, the implementation would look something like this -

def unique_rows_counts(a):

    # Calculate linear indices using rows from a
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )

    # Get the unique indices and their counts
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)

    # return the unique groups from a and their respective counts
    return a[unq_idx], counts

Sample run -

In [64]: a
Out[64]: 
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [2, 2, 2],
       [2, 2, 2],
       [2, 2, 2],
       [3, 3, 0],
       [3, 3, 0],
       [3, 3, 0]])

In [65]: unqrows, counts = unique_rows_counts(a)

In [66]: unqrows
Out[66]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])

Benchmarking

Assuming you are okay with either numpy arrays or collections as outputs, one can benchmark the solutions provided thus far, like so -

Function definitions:

import numpy as np
from collections import Counter

def unique_rows_counts(a):
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
    return a[unq_idx], counts

def map_Counter(a):
    return Counter(map(tuple, a))    

def forloop_Counter(a):      
    c = Counter()
    for x in a:
        c[tuple(x)] += 1
    return c

Timings:

In [53]: a = np.random.randint(0,4,(10000,5))

In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop

In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop

In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Amazingly all the answers including the question got downvotes, makes one wonder! – Divakar Oct 20 '15 at 13:59
  • 2
    @PadraicCunningham I could add runtime tests, but then OP might be looking to have a dict as output, so that won't be fair :) – Divakar Oct 20 '15 at 14:16
  • I will have to wait for the downvoter to release their superior solution so ;) – Padraic Cunningham Oct 20 '15 at 14:22
  • @PadraicCunningham /me is still waiting for this superior answer with bated breath ;-) – aneroid Oct 20 '15 at 19:19
  • @aneroid Added some benchmarks if OP is okay with NumPy arrays as output. Hope that helps a bit on the breathing part ;) – Divakar Oct 20 '15 at 20:47
  • +1 for numPythonic answer :-) And for the clarification. /aside: But I think we were waiting for whoever had downvoted all answers and the question :confused: – aneroid Oct 20 '15 at 21:49
  • @aneroid Yours isn't bad either ;) Someone must had been really bored! – Divakar Oct 20 '15 at 21:52
  • @Divakar Can you also time the `unique` example where its combined with `view`? I would be very curious to see the timings of `view` vs the multi index solution. – Daniel Oct 21 '15 at 13:25
  • @Ophion I am not sure if I follow, view on what exactly? – Divakar Oct 21 '15 at 17:08
  • @Divakar See [here](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array), your `lidx` variable is not needed. – Daniel Oct 21 '15 at 19:05
  • What about if the array contains negative numbers? – Padraic Cunningham Dec 12 '15 at 16:33
  • @PadraicCunningham Then, I would think `a.max(0)` needs to be changed to `a.max(0) - a.min(0)`, though not tested. – Divakar Dec 12 '15 at 17:05
  • @Divakarm so changing to min should work for positive and negative numbers? – Padraic Cunningham Dec 12 '15 at 17:06
  • @PadraicCunningham I would think so, again as I said not tested, so not 100% sure, but the idea with `a.max(0)` was to give enough "separation" between consecutive tuples to be considered/simulate as indices of a multidimensional array. With the negative numbers, we just need to change that "separation" into more appropriate one by incorporating a.min(0). – Divakar Dec 12 '15 at 17:09
  • I keep getting `ValueError: invalid entry in coordinates array`, the only way I could get it to work was to use `clip` instead of raise but not sure what implications that may have – Padraic Cunningham Dec 12 '15 at 17:35
  • 1
    @PadraicCunningham Not tested for correctness, but something like - `base = a.min(0); lidx = np.ravel_multi_index(a.T- base.T[:,None],a.max(0)-base+1 )`. – Divakar Dec 12 '15 at 18:03
  • @PadraicCunningham If you don't mind me asking - What's with the sudden interest in this old post? :) – Divakar Dec 12 '15 at 18:05
  • @Divakar, a related question came up again and out of interest I was trying to compare creating a mapping of non unique elements to counts, it just would not work when there were negative numbers and it was beginning to drive me mad as to why – Padraic Cunningham Dec 12 '15 at 18:08
  • 1
    @PadraicCunningham Ah I see, well good to know it's still being used! – Divakar Dec 12 '15 at 18:09
1

The numpy_indexed package (disclaimer: I am its author) contains efficient vectorized functionality for these kind of operations:

import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)

Note that this works for arrays of any dimensionality or datatype.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Perfect answer but how can we make it as one output: example [[[1,1,9],10],[[1,1,0],2]] – Moustafa Mahmoud Mar 03 '18 at 10:18
  • zip(*npi.count(..) would give that; but that wouldnt be very numpythonic; alternatively you could make a structured array with a composite dtype and assign the result to that, if you insist. But more likely that not you will end up wit ha more efficient solution if you stick to the way numpy likes to organise things natively. – Eelco Hoogendoorn Mar 03 '18 at 15:00
1

Since numpy-1.13.0, np.unique can be used with axis argument:

>>> np.unique(a, axis=0, return_counts=True)

(array([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 0]]), array([3, 3, 3]))
boyangeor
  • 381
  • 3
  • 6