how to hash numpy arrays to check for duplicates

Question

I have searched around a bit for tutorials etc. to help with this problem but cant seem to find anything.

I have two lists of n-dimensional numpy arrays (3D array form of some images) and am wanting to check for overlapping images within each list. Lets says list a is a training set and list b is a validation set. One solution is just to use a nested loop and check if each pair of arrays is equal using np.array(a[i], b[j]) but that is slow (each list has about 200,000 numpy arrays in it) and frankly quite disgusting.

I was thinking a more elegant way of achieving this would be to hash each of the numpy arrays within each list and then compare each entry using these hash tables.

Firstly, is this solution correct, and secondly, how would I go about achieving this?
An example of some of the data is below.

train_dataset[:3]
array([[[-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49215686, -0.5       ],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.49607843, -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.49607843],
        ..., 
        [-0.49607843, -0.49215686, -0.5       , ..., -0.5       ,
         -0.49215686, -0.49607843],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ...,  0.48823529,
          0.5       ,  0.1509804 ],
        [-0.5       , -0.5       , -0.5       , ...,  0.48431373,
          0.14705883, -0.32745099],
        [-0.5       , -0.5       , -0.5       , ..., -0.32745099,
         -0.5       , -0.49607843],
        ..., 
        [-0.5       , -0.44901961,  0.1509804 , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.49607843, -0.49607843, -0.49215686, ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.49607843, -0.48823529, ..., -0.5       ,
         -0.5       , -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.49607843, ..., -0.5       ,
         -0.5       , -0.5       ],
        ..., 
        [-0.5       , -0.5       , -0.5       , ..., -0.48823529,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ]]], dtype=float32)

I appreciate the help in advance.

Show us a bit a what you've done with the pair wise comparison. I don't know what you mean by `disgusting` in this context. — hpaulj, Sep 25 '16 at 00:30
See recent http://stackoverflow.com/questions/39674863/python-alternative-for-using-numpy-array-as-key-in-dictionary. Also look at questions about unique or sorted rows. — hpaulj, Sep 25 '16 at 02:42
basically i was trying this: `duplicates = [] for i in train_dataset: for j in valid_dataset: duplicates.append(np.equal(i,j)` Sorry about the formatting, these comments are weird. — Michael Hall, Sep 25 '16 at 03:40
How does that `np.equal` perform with floats? Usually we recommend `np.allclose` when testing float arrays against each other. (or `np.isclose`) — hpaulj, Sep 25 '16 at 03:58
obviously not very good because I let it run for about 30 minutes and it hadn't finished haha. I'll give those two options a try now. — Michael Hall, Sep 25 '16 at 04:06
That is going to take an extremely long time by the looks of it (just ran started running it and got it to print updates). — Michael Hall, Sep 25 '16 at 04:18

score 2 · Accepted Answer · answered Sep 25 '16 at 19:05

2

The numpy_indexed package (diclaimer: I am its author) has an efficient one-liner for this:

import numpy_indexed as npi
duplicate_images = npi.intersection(train_dataset, test_dataset)

Plus a lot of related functions which you might find useful in this context.

answered Sep 25 '16 at 19:05

Eelco Hoogendoorn

10,459
1
44
42

Is there a way to make this work with separate `X` and `y` (e.g. `X_train`, `y_train`, `X_test`, `y_test`)? I thought about returning indices like `np.intersect1d` allows and then check for the intersection in a second pass, but this is unfortunately not (yet?) supported by `numpy_indexed`. – Felix Oct 07 '22 at 12:33
If I understand you correctly, simply passing in tuples, ie npi.intersection((X_train, y_train), (X_test, y_test)) should accomplish your goal. Perhaps not the best documented feature, but look for LexIndex in the source. – Eelco Hoogendoorn Oct 07 '22 at 18:43

score 0 · Answer 2 · edited May 23 '17 at 12:01

It's not so difficult to come up with something:

from collections import defaultdict
import numpy as np

def arrayhash(arr):
    u = arr.view('u' + str(arr.itemsize))
    return np.bitwise_xor.reduce(u.ravel())

def do_the_thing(a, b, hashfunc=arrayhash):
    table = defaultdict(list)
    for i, a_i in enumerate(a):
        table[hashfunc(a_i)].append(i)

    indices = []
    for j, b_j in enumerate(b):
        candidates = table[hashfunc(b_j)]
        for i in candidates:
            if np.array_equiv(a[i], b_j):
                indices.append((i,j))

    return indices

But note:

Checking for floating point equality is oftentimes a bad idea, because limited precision and rounding error. Famous example:
```
>>> 0.1 + 0.2 == 0.3
False
```
NaN's don't compare equal to themselves:
```
>>> np.nan == np.nan
False
```
The simple hash function above regards the bit-representation of floats, but this is problematic in the presence of negative-zero and signaling NaN's.

See also the discussion in this question: Good way to hash a float vector?

score -1 · Answer 3 · edited May 23 '17 at 11:53

You can find duplicates between arrays using numpy's intersect1d (one dimensional set intersect) function.

duplicate_images = np.intersect1d(train_dataset, test_dataset)

I timed this using the train and test sets (55000 and 10000 arrays respectively) from one of the tensorflow tutorials which I'm guessing is similar to your data. Using intersect1d, it took about 2.4 seconds to complete on my machine (it took only 1.3 seconds with the parameter assume_unique=True). A pairwise comparison like you describe took several minutes.

EDIT

This answer (above) does not compare each "image" array, as @mbhall88 points out in the comments, this compares elements within the arrays, not the arrays themselves. In order to make sure it compares the arrays you can still use intersect1d but you have to mess around with the dtypes first as explained here. But, the example in that answer deals with 2d arrays and since you're working with 3d arrays you should first flatten the second two dimensions. You should be able to do something like:

def intersect3d(A,B, assume_unique=False):
    # get the original shape of your arrays
    a1d, a2d, a3d = A.shape
    # flatten the 2nd and 3rd dimensions in your arrays
    A = A.reshape((a1d,a2d*a3d))
    B = B.reshape((len(B),a2d*a3d))
    # define a structured dtype so you can treat your arrays as single "element"
    dtype=(', '.join([str(A.dtype)]*ncols))
    # find the duplicate elements
    C = np.intersect1d(A.view(dtype), B.view(dtype), assume_unique=assume_unique)
    # reshape the result and return
    return C.view(A.dtype).reshape(-1, ncols).reshape((len(C),a2d,a3d))

The issue with this approach is that it finds elements *within* the arrays that are the same. I am interested in which *arrays* are the same i.e all elements within the array are exactly the same, not just one element. — Michael Hall, Sep 25 '16 at 03:39
You are absolutely right. I believe that [this answer](http://stackoverflow.com/a/8317403/5992438) shows how to use `intersect1d` in the way that you need. I will update the answer with a link as well. — bunji, Sep 25 '16 at 14:40

how to hash numpy arrays to check for duplicates

3 Answers3