Efficiently counting number of unique elements - NumPy / Python

Question

When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second to find the uniques, but this quickly adds up as I need to call np.unique() multiple times. Since I only care about the total number of unique numbers in an array, sorting seems like a waste of time.

Is there a faster method of find the total number of unique values in a large array other than np.unique()?

What is the data type of your array (e.g. what is `a.dtype`)? — Warren Weckesser, Oct 04 '17 at 23:01
pandas' unique function doesn't sort so it is faster. You might want to check that: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.unique.html — ayhan, Oct 04 '17 at 23:07
@twopint32oz Could you time both the posted solutions? It seems there's some discrepancies on the timings. Could you also report CPU model and RAM alongwith? Thanks! — Divakar, Oct 05 '17 at 06:33

score 14 · Accepted Answer · answered Oct 04 '17 at 23:18

14

Here's a method that works for an array with dtype np.uint8 that is faster than np.unique.

First, create an array to work with:

In [128]: a = np.random.randint(1, 128, size=(10, 3000, 3000)).astype(np.uint8)

For later comparison, find the unique values using np.unique:

In [129]: u = np.unique(a)

Here's the faster method; v will contain the result:

In [130]: q = np.zeros(256, dtype=int)

In [131]: q[a.ravel()] = 1

In [132]: v = np.nonzero(q)[0]

Verify that we got the same result:

In [133]: np.array_equal(u, v)
Out[133]: True

Timing:

In [134]: %timeit u = np.unique(a)
2.86 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [135]: %timeit q = np.zeros(256, dtype=int); q[a.ravel()] = 1; v = np.nonzero(q)
300 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 2.86 seconds for np.unique(), and 0.3 seconds for the alternative method.

answered Oct 04 '17 at 23:18

Warren Weckesser

110,654
19
194
214

This worked perfectly. Thanks. It seems like this can also be used to return a sorted unique array as well. I wonder why numpy decided to implement their unique() the way they did. – onepint16oz Oct 05 '17 at 02:34
`numpy.unique` can handle arrays of arbitrary data type (all the various integer and floating point types, complex types, and structured arrays). It does seem like it would be feasible for `numpy.unique` to handle the one (and possibly two) byte integer types with this method as a special case. – Warren Weckesser Oct 05 '17 at 06:23
There's nothing stopping you from using this technique for higher bit depths, except that the tallying array `q` quickly becomes enormous, right? It works for `uint16`, too, taking 0.6 s instead of 5.6 s. – endolith Aug 07 '19 at 13:58

Divakar · Answer 2 · 2017-10-05T04:40:54.627

We could leverage the fact that the elements are restricted to uint8 range by binned-counting with np.bincount and then simply count the number of non-zeros in it. Since, np.bincount expects a 1D array, we would flatten the input with np.ravel() and then feed it to bincount.

Hence, the implementation would be -

(np.bincount(a.ravel())!=0).sum()

Runtime test

Helper function to create input array with various number of unique numbers -

def create_input(n_unique):
    unq_nums = np.random.choice(np.arange(256), n_unique,replace=0)
    return np.random.choice(unq_nums, (10,3000,3000)).astype(np.uint8)

Other approach(es) :

# @Warren Weckesser's soln
def assign_method(a):
    q = np.zeros(256, dtype=int)
    q[a.ravel()] = 1
    return len(np.nonzero(q)[0])

Verification of proposed method -

In [141]: a = create_input(n_unique=120)

In [142]: len(np.unique(a))
Out[142]: 120

In [143]: (np.bincount(a.ravel())!=0).sum()
Out[143]: 120

Timings -

In [124]: a = create_input(n_unique=128)

In [125]: %timeit len(np.unique(a)) # Original soln
     ...: %timeit assign_method(a)  # @Warren Weckesser's soln
     ...: %timeit (np.bincount(a.ravel())!=0).sum()
     ...: 
1 loop, best of 3: 3.09 s per loop
1 loop, best of 3: 394 ms per loop
1 loop, best of 3: 209 ms per loop

In [126]: a = create_input(n_unique=256)

In [127]: %timeit len(np.unique(a)) # Original soln
     ...: %timeit assign_method(a)  # @Warren Weckesser's soln
     ...: %timeit (np.bincount(a.ravel())!=0).sum()
     ...: 
1 loop, best of 3: 3.46 s per loop
1 loop, best of 3: 378 ms per loop
1 loop, best of 3: 212 ms per loop

Nice. I had also experimented with `bincount`, but it appeared to be slower. In fact, when I use your code, I get 295 ms for `assign_method(a)` and 425 ms for `(np.bincount(a.ravel())!=0).sum()`. Go figure. — Warren Weckesser, Oct 05 '17 at 06:20
@WarrenWeckesser Could be a hardware thing (CPU, RAM). I am on Intel i7-6700HQ, 16GB RAM. But, yours are reproducible on ideone : https://ideone.com/WDWq9j — Divakar, Oct 05 '17 at 06:31
I'm using a "Late 2013" Macbook Pro, 2.6 GHz Intel Core i7,16 GB 1600 MHz DDR3. Also: Python 3.5.2, numpy 1.13.1. — Warren Weckesser, Oct 05 '17 at 06:39
@Divakar Im using 2015 Macbook Pro, 2.9 GHz Intel Core i5, 8GB 1867MHz DDR3, Python 3.5 with numpy 1.12.1. I get 320ms for `assign_method(a)` and 592ms for `(np.bincount(a.ravel()) !=0).sum()` for 128 uniques. Similar results for 256 uniques. — onepint16oz, Oct 06 '17 at 17:04

fuglede · Answer 3 · 2019-06-29T08:48:59.190

If you don't mind using Numba to JIT compile your code, and modifying your code to make it easy for Numba to do its magic, then it's possible to get some good improvements over the suggestions already listed in the other answers.

Using the naming from @Divakar's post:

from numba import jit
import numpy as np

def create_input(n_unique):
    unq_nums = np.random.choice(np.arange(256), n_unique, replace=0)
    return np.random.choice(unq_nums, (10, 3000, 3000)).astype(np.uint8)

def unique(a):
    return len(np.unique(a))

def assign_method(a):
    q = np.zeros(256, dtype=int)
    q[a.ravel()] = 1
    return len(np.nonzero(q)[0])

def bincount(a):
    return (np.bincount(a.ravel())!=0).sum()

def numba_friendly(a):
    q = np.zeros(256, dtype=int)
    count = 0
    for x in a.ravel():
        if q[x] == 0:
            q[x] = 1
            count += 1
    return count

unique_jit = jit(unique)
assign_method_jit = jit(assign_method)
bincount_jit = jit(bincount)
numba_friendly_jit = jit(numba_friendly)

Benchmark:

a = create_input(n_unique=128)

%timeit unique(a)
%timeit unique_jit(a)
%timeit assign_method(a)
%timeit assign_method_jit(a)
%timeit bincount(a)
%timeit bincount_jit(a)
%timeit numba_friendly_jit(a)

Results:

unique:               7.5 s ± 1.14 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique_jit:          13.4 s ± 2.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
assign_method:       388 ms ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
assign_method_jit:   341 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
bincount:            2.71 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
bincount_jit:        138 ms ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba_friendly_jit: 56.4 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can modify the numba method to count the unique elements, too, right? — endolith, Aug 07 '19 at 14:01
Sure; just be aware that you may have to roll your own hash map if you want that as your underlying structure; the example at hand is small enough to fit the counts in an array though. — fuglede, Aug 07 '19 at 14:14

Efficiently counting number of unique elements - NumPy / Python

3 Answers3

Linked