3

I have two multidimensional arrays with the same second dimension. I want to be sure no element (i.e., no row) of the first array is also a row of the second array.

To do this I am using numpy.where, but its behaviour is also checking for sub-elements in the same position. For example consider this code:

x = np.array([[0,1,2,3], [4,0,6,9]])
z= np.array([[0,1,2,3], [5, 11, 6,98]])
for el in x:
    print(np.where(z==el))

It prints:

(array([0, 0, 0, 0]), array([0, 1, 2, 3]))
(array([1]), array([2]))

where the first result is due to the first arrays being equal, the second is because the both z[1] and x[1] have 6 as third element. Is there a way to tell np.where to return only indexes of strictly equal elements, i.e. 0 in the example above?

Mr. T
  • 11,960
  • 10
  • 32
  • 54
roschach
  • 8,390
  • 14
  • 74
  • 124

4 Answers4

2
[i for i, e in enumerate(x) if (e == z).all(1).any()]

Test case:

x = np.array([[0,1,2,3], [4,0,6,9], [4,0,6,19]])
z= np.array([[4,0,6,9], [0,1,2,3]])

[i for i, e in enumerate(x) if (e == z).all(1).any()]

Output:

[0, 1]
mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Using vectorized approach will speed this up dramatically - see my answer – orgoro Nov 18 '20 at 10:50
  • Actually in my test your answer is only about a 10% speedup over this. And my void view method is about 60x faster than either. – Daniel F Nov 18 '20 at 11:01
  • This is a good answer and I like the simplicity, getting also the indexes of the first vector would be even better. – roschach Nov 18 '20 at 11:10
2

Where simply returns the indices of your condition - here it's element wise equal

Answer

You can find the duplicates using vectorized operations:

duplicates = (x[:, None] == z).all(-1).any(-1)

Get Values

To get the duplicates values use masking

x[duplicates]

in this example:

duplicates = [True False]

x[duplicates] = [[0, 1, 2, 3]]

Logic

  1. expanding the array [:, None]
  2. find only full row matches all(-1)
  3. return rows that have at least one match any(-1)
orgoro
  • 186
  • 2
  • 7
1

Man I haven't had a chance to link to this answer since np.unique added an axis parameter. Credit to @Jaime

vview = lambda a: np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))

Basically, that takes the "rows" of your matrix and turns them into a 1-d array of views on the raw datastream of the rows. This lets you compare rows as if they were single values.

Then it's fairly simple:

print(np.where(vview(x) == vview(z).T))
(array([0], dtype=int64), array([0], dtype=int64))

Representing that the 1st row of x matches the first row of z

If you only want to know if rows of x are in rows of z:

print(np.where(np.isin(vview(x), vview(z)).squeeze()))
(array([0], dtype=int64),)

Checking times compared to @mujjiga on big arrays:

x = np.random.randint(10, size = (1000, 4))

z = np.random.randint(10, size = (1000, 4))

%timeit np.where(np.isin(vview(x), vview(z)).squeeze())
365 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i, e in enumerate(x) if (e == z).all(1).any()]  # @mujjiga
21.3 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.where((x[:, None] == z).all(-1).any(-1))  # @orgoro
20 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So about a 60x speedup over looping and slicing, probably due to quick short-circuiting and only comparing 1/4 the values

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • That is no-every-day code: I like it. Let me recap to see if I get it right. `ascontigous()` returns a contiguous array representation in memory, the `view` method basically tells how to reformat that array with no specified type (`np.void`) and I assume `a.dtype.itemsize * a.shape[1]` is the byte size of one row. I don't get why the transpose when comparing though. – roschach Nov 18 '20 at 11:06
  • At some point `view` started keeping dimensions, so the output of that snippet for `(n, m)` input is `(n, 1)` now instead of `(n,)`. So, that means I can trade `[:, None]` and `[None, :]` that's usually needed for broadcasting for a simple transpose. – Daniel F Nov 18 '20 at 11:12
0

Well, for 2D arrays, something like the following might be useful. I think you'd have to be careful with check ee==0 in floating point arithmetic.

import numpy as np
aa = np.arange(16).reshape(4,4)
# we are trying to find the row in aa which is equal to bb
bb = np.asarray([0,1,2,3])
cc = bb[None,:]
dd = aa - cc
ee = np.linalg.norm(dd,axis=1)
idx = np.where(ee==0)
NNN
  • 697
  • 1
  • 5
  • 15
  • This is not robust. if `a = [1,2,3]` and `b=[1,1,4]`, `(a-b).sum()==0` is true, but even though the entries don't match – Daniel F Nov 18 '20 at 10:34
  • Thanks for pointing that out. Let me see if I can modify it. Edit: replaced np.sum by np.linalg.norm. – NNN Nov 18 '20 at 10:53
  • 1
    I mean, it works, but that's a ton of computational overhead even compared to looping. Square roots are sloooooow – Daniel F Nov 18 '20 at 11:15