np.where checking also for subelements in multidimensional arrays

Question

I have two multidimensional arrays with the same second dimension. I want to be sure no element (i.e., no row) of the first array is also a row of the second array.

To do this I am using numpy.where, but its behaviour is also checking for sub-elements in the same position. For example consider this code:

x = np.array([[0,1,2,3], [4,0,6,9]])
z= np.array([[0,1,2,3], [5, 11, 6,98]])
for el in x:
    print(np.where(z==el))

It prints:

(array([0, 0, 0, 0]), array([0, 1, 2, 3]))
(array([1]), array([2]))

where the first result is due to the first arrays being equal, the second is because the both z[1] and x[1] have 6 as third element. Is there a way to tell np.where to return only indexes of strictly equal elements, i.e. 0 in the example above?

mujjiga · Answer 1 · 2020-11-18T10:19:55.717

2

[i for i, e in enumerate(x) if (e == z).all(1).any()]

Test case:

x = np.array([[0,1,2,3], [4,0,6,9], [4,0,6,19]])
z= np.array([[4,0,6,9], [0,1,2,3]])

[i for i, e in enumerate(x) if (e == z).all(1).any()]

Output:

[0, 1]

edited Nov 18 '20 at 10:19

answered Nov 18 '20 at 09:43

mujjiga

16,186
2
33
51

Using vectorized approach will speed this up dramatically - see my answer – orgoro Nov 18 '20 at 10:50
Actually in my test your answer is only about a 10% speedup over this. And my void view method is about 60x faster than either. – Daniel F Nov 18 '20 at 11:01
This is a good answer and I like the simplicity, getting also the indexes of the first vector would be even better. – roschach Nov 18 '20 at 11:10

orgoro · Answer 2 · 2020-11-18T10:53:08.623

Where simply returns the indices of your condition - here it's element wise equal

Answer

You can find the duplicates using vectorized operations:

duplicates = (x[:, None] == z).all(-1).any(-1)

Get Values

To get the duplicates values use masking

x[duplicates]

in this example:

duplicates = [True False]

x[duplicates] = [[0, 1, 2, 3]]

Logic

expanding the array [:, None]
find only full row matches all(-1)
return rows that have at least one match any(-1)

Daniel F · Accepted Answer · 2020-11-18T11:05:14.593

Man I haven't had a chance to link to this answer since np.unique added an axis parameter. Credit to @Jaime

vview = lambda a: np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))

Basically, that takes the "rows" of your matrix and turns them into a 1-d array of views on the raw datastream of the rows. This lets you compare rows as if they were single values.

Then it's fairly simple:

print(np.where(vview(x) == vview(z).T))
(array([0], dtype=int64), array([0], dtype=int64))

Representing that the 1st row of x matches the first row of z

If you only want to know if rows of x are in rows of z:

print(np.where(np.isin(vview(x), vview(z)).squeeze()))
(array([0], dtype=int64),)

Checking times compared to @mujjiga on big arrays:

x = np.random.randint(10, size = (1000, 4))

z = np.random.randint(10, size = (1000, 4))

%timeit np.where(np.isin(vview(x), vview(z)).squeeze())
365 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i, e in enumerate(x) if (e == z).all(1).any()]  # @mujjiga
21.3 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.where((x[:, None] == z).all(-1).any(-1))  # @orgoro
20 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So about a 60x speedup over looping and slicing, probably due to quick short-circuiting and only comparing 1/4 the values

That is no-every-day code: I like it. Let me recap to see if I get it right. `ascontigous()` returns a contiguous array representation in memory, the `view` method basically tells how to reformat that array with no specified type (`np.void`) and I assume `a.dtype.itemsize * a.shape[1]` is the byte size of one row. I don't get why the transpose when comparing though. — roschach, Nov 18 '20 at 11:06
At some point `view` started keeping dimensions, so the output of that snippet for `(n, m)` input is `(n, 1)` now instead of `(n,)`. So, that means I can trade `[:, None]` and `[None, :]` that's usually needed for broadcasting for a simple transpose. — Daniel F, Nov 18 '20 at 11:12

NNN · Answer 4 · 2020-11-18T10:57:21.403

0

Well, for 2D arrays, something like the following might be useful. I think you'd have to be careful with check ee==0 in floating point arithmetic.

import numpy as np
aa = np.arange(16).reshape(4,4)
# we are trying to find the row in aa which is equal to bb
bb = np.asarray([0,1,2,3])
cc = bb[None,:]
dd = aa - cc
ee = np.linalg.norm(dd,axis=1)
idx = np.where(ee==0)

edited Nov 18 '20 at 10:57

answered Nov 18 '20 at 10:22

NNN

697
1
5
15

This is not robust. if `a = [1,2,3]` and `b=[1,1,4]`, `(a-b).sum()==0` is true, but even though the entries don't match – Daniel F Nov 18 '20 at 10:34
Thanks for pointing that out. Let me see if I can modify it. Edit: replaced np.sum by np.linalg.norm. – NNN Nov 18 '20 at 10:53
1

I mean, it works, but that's a ton of computational overhead even compared to looping. Square roots are sloooooow – Daniel F Nov 18 '20 at 11:15

np.where checking also for subelements in multidimensional arrays

4 Answers4

Test case:

Answer

Get Values

Logic

Linked