0

I have many large 1D arrays and I'd like to grab the unique values. Typically, one could do:

x = np.random.randint(10000, size=100000000)
np.unique(x)

However, this performs an unnecessary sort of the array. The docs for np.unique do not mention any way to retrieve the indices without sorting. Other answers with np.unique include using return_index but, as I understand it, the array is still being sorted. So, I tried using set:

set(x)

But this is way slower than sorting the array with np.unique. Is there a faster way to retrieve the unique values for this array that avoids sorting and is faster than np.unique?

slaw
  • 6,591
  • 16
  • 56
  • 109
  • 1
    You can use pandas : `pd.Series(x).unique()`. Seems a bit faster. – Divakar Feb 12 '20 at 21:02
  • Sorting is needed to efficiently check for duplicates especially when the arrays become larger so I think the most efficient algorithm includes sorting 'under the hood' . – BramAppel Feb 12 '20 at 21:08
  • @Divakar I was hoping to keep this in NumPy-land in order to avoid adding an additional package dependency since I expect to open source the code. I want to make sure that the juice is worth the squeeze – slaw Feb 12 '20 at 21:10
  • pandas is just as *open* as NumPy. – Divakar Feb 12 '20 at 21:11
  • @BramAppel If the output order is not important, then I would expect this to be an O(`n`) time complexity since we can stuff the results into, say, a dictionary. Right now, with `np.unique`, it would be O(`n*log(n)`) – slaw Feb 12 '20 at 21:12
  • @Divakar the emphasis is on `dependency` :) But your original point is well taken! I was just surprised to see that there isn't an option to do get unique values without sorting and assumed that I was overlooking something. – slaw Feb 12 '20 at 21:13
  • @slaw I see. The only difference being that you take the first duplicate in np.unique and the last duplicate using the dict approach which makes no difference for the result of coarse. – BramAppel Feb 12 '20 at 21:17
  • 1
    `unique` works by sorting, and then looking for adjacent matching values. Whether you ask for the index or not, it doesn't change the basic mechanism. `set` uses Python's hashing (which is also used for `dict`). Is there some other, more efficient, approach? – hpaulj Feb 12 '20 at 21:51
  • @hpaulj in my timing tests, `pd.Series(x).unique()` is around 5x faster than both `np.unique(x)` and `set(x)` where `x.shape[0]` is `1_000_000_000` in length – slaw Feb 13 '20 at 03:03
  • And the `pd` `unique` docs say it's hash based (and undoubtedly compiled). – hpaulj Feb 13 '20 at 03:33
  • Yeah, I have a `numba.njit` function somewhere around here that does this, and it's quite a bit faster than `np.unique` - but OP doesn't want extra dependencies – Daniel F Feb 13 '20 at 09:45

2 Answers2

1

If your values are positive integers in a relatively small range (e.g. 0 ... 10000), there is an alternative way to obtain a list of unique values using masks: (see unique2() below)

import numpy as np

def unique1(x):
    return np.unique(x)

def unique2(x):
    maxVal    = np.max(x)+1
    values    = np.arange(maxVal)
    used      = np.zeros(maxVal)
    used[x]   = 1
    return values[used==1]

# optimized (with option to provide known value range)
def unique3(x,maxVal=None):
    maxVal    = maxVal or np.max(x)+1
    used      = np.zeros(maxVal,dtype=np.uint8)
    used[x]   = 1
    return np.argwhere(used==1)[:,0]

In my tests this method is a lot faster than np.unique and it does not involve sorting:

from timeit import timeit
count = 3
x = np.random.randint(10000, size=100000000)

t = timeit(lambda:unique1(x),number=count)
print("unique1",t)

t = timeit(lambda:unique2(x),number=count)
print("unique2",t)

t = timeit(lambda:unique3(x),number=count)
print("unique3",t)

t = timeit(lambda:unique3(x,10000),number=count)
print("unique3",t, "with known value range")


# unique1 16.894681214000002
# unique2 0.8627655060000023
# unique3 0.8411087540000004
# unique3 0.5896318829999991 with known value range
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • You should create the zeroes with`dtype=uint8`. IIRC dtype=bool doesn't do anything useful. – o11c Feb 12 '20 at 23:29
  • Note also that `gmpy2.xmpz` can be used as a bitvector for more memory-efficiency (and also avoids the temporaries). but doesn't allow the `values[array]` trick so will cost a lot more CPU unless you're using something that JITs your loops. – o11c Feb 12 '20 at 23:30
  • I merely wanted to point out an alternative method with orders of magnitude improvements in speed. I agree that it can be further improved by a few % with low level optimizations (e.g. using unit8 shaves off 3%). – Alain T. Feb 12 '20 at 23:36
0

Just in case you change your mind about dependencies, here's a dirt simple numba.njit implementation:

import numba

@numba.njit
def unique(arr):
    return np.array(list(set(arr)))


%timeit unique(x) #using Alain T.'s benchmark array
2.64 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.unique(x)
5.45 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not as lightning fast as Above, but doesn't require positive integer inputs, either.

Daniel F
  • 13,620
  • 2
  • 29
  • 55