14

I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.

Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.

Scott Gigante
  • 1,450
  • 1
  • 17
  • 29
  • Do you mean "quickly" in terms of writing the code quickly (and, presumably, understanding it quickly when you read is six months from now), or executing quickly? – abarnert Jun 15 '18 at 22:58
  • Also, if false positives are acceptable, you may want to use a [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) (I'm pretty sure there are good numpy-based implementations out there). Even if false positives _aren't_ acceptable, it may be faster (and will certainly take less memory) to first run a bloom filter, then check the uniqueness of the possibles with a set or `np.unique`. – abarnert Jun 15 '18 at 23:15
  • My question was specifically about execution speed, but of course incomprehensible code is nobody's best friend. The bloom filter is a very interesting approach! – Scott Gigante Jun 15 '18 at 23:21
  • To put alternatives in perspective, `np.unique` does a `sort`, and looks for pairwise differences. It's likely to be the best approach if duplicates are so rare that short-circuiting approaches end up scanning most of the array anyways. – hpaulj Jun 16 '18 at 00:18
  • What type of values are in your array? I.e., what is `arr.dtype`? – Warren Weckesser Jun 16 '18 at 02:05
  • @warren I didn't even think about that. My array contains 30000 `np.str`s of length <20. I suppose this might alter performance considerably. – Scott Gigante Jun 17 '18 at 01:09

3 Answers3

14

Here are the four ways I thought of doing it.

TL;DR: if you expect very few (less than 1/1000) duplicates:

def contains_duplicates(X):
    return len(np.unique(X)) != len(X)

If you expect frequent (more than 1/1000) duplicates:

def contains_duplicates(X):
    seen = set()
    seen_add = seen.add
    for x in X:
        if (x in seen or seen_add(x)):
            return True
    return False

The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.

>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...:     elems = set()
...:     for i in X:
...:         if i in elems:
...:             return True
...:         elems.add(i)
...:     return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...:     seen = set()
...:     seen_add = seen.add
...:     for x in X:
...:         if (x in seen or seen_add(x)):
...:             return True
...:     return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop

Do things change if you start with an ordinary Python list, and not a numpy.ndarray?

>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop

Edit: what if we have a prior expectation of the number of duplicates?

The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.

>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>>     print("{} duplicates".format(n_duplicates))
>>>     duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>>     X[duplicate_idx] = 0
>>>     print("terhost_early_exit")
>>>     %timeit terhorst_early_exit(X)
>>>     print("peterbe_early_exit")
>>>     %timeit peterbe_early_exit(X)
>>>     print("set length")
>>>     %timeit len(set(X)) != len(X)
>>>     print("numpy unique length")
>>>     %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop

So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.

Scott Gigante
  • 1,450
  • 1
  • 17
  • 29
  • 2
    I'm not sure this is a fair test. Duplicates are going to be incredibly unlikely with your data, and therefore early exit is pointless. If the real-life data set will typically have, say, 10 duplicates, early exit will make it about 10x as fast. – abarnert Jun 15 '18 at 23:02
  • 1
    Very good point. I've added a discussion of expected number of duplicates and indeed the early exit method dominates as the number of duplicates increases. – Scott Gigante Jun 15 '18 at 23:22
  • 1
    Another possible thing to worry about is whether memory allocation is an issue. For huge inputs, whether you build a set of zillions of floats, or let `unique` return an array of zillions of floats plus use whatever temporary storage it needs, that could make a huge difference. But I wrote an answer on how to avoid that if it's an issue, because it's too much to put in a comment. – abarnert Jun 15 '18 at 23:59
2

Depending on how large your array is, and how likely duplicates are, the answer will be different.

For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.

Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.

Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.

So, there's no one best solution for all possible scenarios.


Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:

from bitarray import bitarray # pip3 install bitarray

def dupcheck(X):
    # Hardcoded values to give about 5% false positives for 10000 elements
    size = 62352
    hashcount = 4
    bits = bitarray(size)
    bits.setall(0)
    def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
        for i in range(hashcount):
            if not bits[hash((x, i)) % size]: return False
        return True
    def add(x):
        for i in range(hashcount):
            bits[hash((x, i)) % size] = True
    seen = set()
    seen_add = seen.add
    for x in X:
        if check(x) or add(x):
            if x in seen or seen_add(x):
                return True
    return False

This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.

Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).

Length 8

Timing test iterations: 10000
Function          Min      Avg Sec  Conclusion      p-value
----------  ---------  -----------  ------------  ---------
set_len     0          7.73486e-06  Baseline
unique_len  9.644e-06  2.55573e-05  Slower                0

Length 1000

Timing test iterations: 10000
Function           Min      Avg Sec  Conclusion      p-value
----------  ----------  -----------  ------------  ---------
set_len     0.00011066  0.000270466  Baseline
unique_len  4.3684e-05  8.95608e-05  Faster                0

Then I tried my own implementation, but I think it would require optimized C code to beat set:

def check_items(key_rand, **kwargs):
    for i, vali in enumerate(key_rand):
        for j in range(i+1, len(key_rand)):
            valj = key_rand[j]
            if vali == valj:
                break

Length 8

Timing test iterations: 10000
Function            Min      Avg Sec  Conclusion      p-value
-----------  ----------  -----------  ------------  ---------
set_len      0           6.74221e-06  Baseline
unique_len   0           2.14604e-05  Slower                0
check_items  1.1138e-05  2.16369e-05  Slower                0

(using my randomized compare_time() function from easyinfo)

Joshua M
  • 401
  • 4
  • 9