Efficient way to check high dimensional arrays are overlapped in two ndarray in Python

Question

For example, I have two ndarrays, the shape of train_dataset is (10000, 28, 28) and the shape of val_dateset is (2000, 28, 28).

Except for using iterations, is there any efficient way to use the numpy array functions to find the overlap between two ndarrays?

Could you explain exactly what you mean by "overlap"? Are you looking for the indices of rows that are found in both `train_dataset` and `val_dataset`? — ali_m, Jan 24 '16 at 19:46
Right, I want to find out the elements(28*28) that appear in both dataset. — Vicky, Jan 25 '16 at 01:48
If by any chance you're trying to _create_ training and validation datasets, you are much better off using scikit-learn's [cross_validation module](http://scikit-learn.org/stable/modules/cross_validation.html). — Praveen, Jan 25 '16 at 19:25

ali_m · Answer 1 · 2018-01-20T16:40:54.890

One trick I learnt from Jaime's excellent answer here is to use an np.void dtype in order to view each row in the input arrays as a single element. This allows you to treat them as 1D arrays, which can then be passed to np.in1d or one of the other set routines.

import numpy as np

def find_overlap(A, B):

    if not A.dtype == B.dtype:
        raise TypeError("A and B must have the same dtype")
    if not A.shape[1:] == B.shape[1:]:
        raise ValueError("the shapes of A and B must be identical apart from "
                         "the row dimension")

    # reshape A and B to 2D arrays. force a copy if neccessary in order to
    # ensure that they are C-contiguous.
    A = np.ascontiguousarray(A.reshape(A.shape[0], -1))
    B = np.ascontiguousarray(B.reshape(B.shape[0], -1))

    # void type that views each row in A and B as a single item
    t = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))

    # use in1d to find rows in A that are also in B
    return np.in1d(A.view(t), B.view(t))

For example:

gen = np.random.RandomState(0)

A = gen.randn(1000, 28, 28)
dupe_idx = gen.choice(A.shape[0], size=200, replace=False)
B = A[dupe_idx]

A_in_B = find_overlap(A, B)

print(np.all(np.where(A_in_B)[0] == np.sort(dupe_idx)))
# True

This method is much more memory-efficient than Divakar's, since it doesn't require broadcasting out to an (m, n, ...) boolean array. In fact, if A and B are row-major then no copying is required at all.

For comparison I've slightly adapted Divakar and B. M.'s solutions.

def divakar(A, B):
    A.shape = A.shape[0], -1
    B.shape = B.shape[0], -1
    return (B[:,None] == A).all(axis=(2)).any(0)

def bm(A, B):
    t = 'S' + str(A.size // A.shape[0] * A.dtype.itemsize)
    ma = np.frombuffer(np.ascontiguousarray(A), t)
    mb = np.frombuffer(np.ascontiguousarray(B), t)
    return (mb[:, None] == ma).any(0)

Benchmarks:

In [1]: na = 1000; nb = 200; rowshape = 28, 28

In [2]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
divakar(A, B)
   ....: 
1 loops, best of 3: 244 ms per loop

In [3]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
   ....: 
100 loops, best of 3: 2.81 ms per loop

In [4]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
   ....: 
100 loops, best of 3: 15 ms per loop

As you can see, B. M.'s solution is slightly faster than mine for small n, but np.in1d scales better than testing equality for all elements (O(n log n) rather than O(n²) complexity).

In [5]: na = 10000; nb = 2000; rowshape = 28, 28

In [6]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
   ....: 
1 loops, best of 3: 271 ms per loop

In [7]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
   ....: 
10 loops, best of 3: 123 ms per loop

Divakar's solution is intractable on my laptop for arrays of this size, since it requires generating a 15GB intermediate array whereas I only have 8GB RAM.

Thanks this solution is really helpful. I am trying to understand it a little better though. What is the reason for using `np.ascontiguousarray(A.reshape(A.shape[0], -1))` instead of `np.array([x.flatten() for x in A])` which I have seen other code use to do something similar? Is it a style thing, or do they do something different? — Barker, Sep 05 '17 at 17:24
@Barker Try timing those two lines for a reasonably large input array. Firstly, the list comprehension will almost certainly be slower than a single call to `reshape`, especially if there are lots of rows in `A`. Also `.flatten()` always returns a copy (whereas `.reshape()` and `.ravel()` only return a copy if necessary), so you're making a temporary copy of each row in `A`, then another copy when you call `np.array(...)` on the list. `np.ascontiguousarray` only returns a copy if necessary, so my line creates at most a single copy of `A`. — ali_m, Sep 05 '17 at 22:55
This is the only solution (on this page) that works well for larger data sets. It should have more upvotes! — Per Quested Aronsson, Jan 19 '18 at 17:17

score 3 · Answer 2 · answered Jan 25 '16 at 06:58

Memory permitting you could use broadcasting, like so -

val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]

Sample run -

In [55]: train_dataset
Out[55]: 
array([[[1, 1],
        [1, 1]],

       [[1, 0],
        [0, 0]],

       [[0, 0],
        [0, 1]],

       [[0, 1],
        [0, 0]],

       [[1, 1],
        [1, 0]]])

In [56]: val_dateset
Out[56]: 
array([[[0, 1],
        [1, 0]],

       [[1, 1],
        [1, 1]],

       [[0, 0],
        [0, 1]]])

In [57]: val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]
Out[57]: 
array([[[1, 1],
        [1, 1]],

       [[0, 0],
        [0, 1]]])

If the elements are integers, you could collapse every block of axis=(1,2) in the input arrays into a scalar assuming them as linearly index-able numbers and then efficiently use np.in1d or np.intersect1d to find the matches.

B. M. · Answer 3 · 2016-01-25T20:47:15.290

Full broadcasting generate here a 10000*2000*28*28 =150 Mo boolean array.

For efficiency, you can :

pack data, for a 200 ko array:

from pylab import *
N=10000
a=rand(N,28,28)
b=a[[randint(0,N,N//5)]]

packedtype='S'+ str(a.size//a.shape[0]*a.dtype.itemsize) # 'S6272' 
ma=frombuffer(a,packedtype)  # ma.shape=10000
mb=frombuffer(b,packedtype)  # mb.shape=2000

%timeit a[:,None]==b   : 102 s
%timeit ma[:,None]==mb   : 800 ms
allclose((a[:,None]==b).all((2,3)),(ma[:,None]==mb)) : True

less memory is helped here by lazy string comparison, breaking at first difference :

In [31]: %timeit a[:100]==b[:100]
10000 loops, best of 3: 175 µs per loop

In [32]: %timeit a[:100]==a[:100]
10000 loops, best of 3: 133 µs per loop

In [34]: %timeit ma[:100]==mb[:100]
100000 loops, best of 3: 7.55 µs per loop

In [35]: %timeit ma[:100]==ma[:100]
10000 loops, best of 3: 156 µs per loop

Solutions are given here with (ma[:,None]==mb).nonzero().

use in1d, for a (Na+Nb) ln(Na+Nb) complexity, against Na*Nb on full comparison :
```
%timeit in1d(ma,mb).nonzero()  : 590ms 
```

Not a big gain here, but asymptotically better.

Great! Never knew short-circuiting could be achieved with strings like this. But maybe add a `ascontiguousarray` in there, so it works properly for arrays like `a = rand(28,28,N).T`. — , Jan 25 '16 at 22:16
@morningsun : you are right. In this case I just create the array, so it is contiguous, but for external data, `ma=frombuffer(ascontiguousarray(a),packedtype)` is safer. thanks. — B. M., Jan 28 '16 at 21:51

score 1 · Answer 4 · edited Jan 25 '16 at 14:44

Solution

def overlap(a,b):
    """
    returns a boolean index array for input array b representing
    elements in b that are also found in a
    """
    a.repeat(b.shape[0],axis=0)
    b.repeat(a.shape[0],axis=0)
    c = aa == bb
    c = c[::a.shape[0]]
    return c.all(axis=1)[:,0]

You can use the returned index array to index b to extract the elements which are also found in a

b[overlap(a,b)]

Explanation

For simplicity's sake I assume you have imported everything from numpy for this example:

from numpy import *

So, for example, given two ndarrays

a = arange(4*2*2).reshape(4,2,2)
b = arange(3*2*2).reshape(3,2,2)

we repeat a and b so that they have the same shape

aa = a.repeat(b.shape[0],axis=0)
bb = b.repeat(a.shape[0],axis=0)

we can then simply compare the elements of aa and bb

c = aa == bb

Finally, to get the indices of the elements in b which are also found in a by looking at every 4th, or actually, every shape(a)[0]th element of c

cc == c[::a.shape[0]]

Finally, we extract an index array with only the elements where all elements in the sub-arrays are True

c.all(axis=1)[:,0]

In our example we get

array([True,  True,  True], dtype=bool)

To check, change the first element of b

b[0] = array([[50,60],[70,80]])

and we get

array([False,  True,  True], dtype=bool)

The [solution by Divakar](http://stackoverflow.com/a/34986744/1170207) is actually more clear and concise, +1. — Jan Christoph Terasa, Jan 25 '16 at 08:20
I'd advise against polluting your namespace by doing `from numpy import *` — ali_m, Jan 26 '16 at 01:40

score 1 · Answer 5 · answered Mar 11 '16 at 02:06

This question comes form Google's online deep learning course? The following is my solution:

sum = 0 # number of overlapping rows
for i in range(val_dataset.shape[0]): # iterate over all rows of val_dataset
    overlap = (train_dataset == val_dataset[i,:,:]).all(axis=1).all(axis=1).sum()
    if overlap:
        sum += 1
print(sum)

Automatic broadcasting is used instead of iteration. You may test the performance difference.

Efficient way to check high dimensional arrays are overlapped in two ndarray in Python

5 Answers5

Benchmarks:

Solution

Explanation