4

Lets say we have two integer numpy array A and B of size (N,M). I would like to check for every i in N, is A[i,:] in B[i,:].

A for-loop implementation is:

for i in range(N):
    C[i] = np.isin(A[i,:],B[i,:])

However this is quite slow for large array. is there any faster way to implement this? (e.g. vectorization?)

Thanks!

Wang Duo
  • 723
  • 6
  • 11

3 Answers3

2

Here's one vectorized approach based on the per-row offsetting as discussed in more details in Vectorized searchsorted numpy's solution -

# https://stackoverflow.com/a/40588862/ @Divakar
def searchsorted2d(a,b):
    m,n = a.shape
    max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
    r = max_num*np.arange(a.shape[0])[:,None]
    p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
    return p - n*(np.arange(m)[:,None])

def numpy_isin2D(A,B):
    sB = np.sort(B,axis=1)
    idx = searchsorted2d(sB,A)
    idx[idx==sB.shape[1]] = 0
    return np.take_along_axis(sB, idx, axis=1) == A

Sample run -

In [351]: A
Out[351]: 
array([[5, 0, 3, 3],
       [7, 3, 5, 2],
       [4, 7, 6, 8],
       [8, 1, 6, 7],
       [7, 8, 1, 5]])

In [352]: B
Out[352]: 
array([[8, 4, 3, 0, 3, 5],
       [0, 2, 3, 8, 1, 3],
       [3, 3, 7, 0, 1, 0],
       [4, 7, 3, 2, 7, 2],
       [0, 0, 4, 5, 5, 6]])

In [353]: numpy_isin2D(A,B)
Out[353]: 
array([[ True,  True,  True,  True],
       [False,  True, False,  True],
       [False,  True, False, False],
       [False, False, False,  True],
       [False, False, False,  True]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

The numpy_indexed package (disclaim: I am its author) can be used to obtain a solution similar to Divakars, but with the low-level particulars abstracted away:

import numpy_indexed as npi
Ar = np.indices(A.shape)[0]
Br = np.indices(B.shape)[0]
isin = npi.in_((A.flatten(), Ar.flatten()), (B.flatten(), Br.flatten())).reshape(A.shape)

All functions in the numpy_indexed package operate equally on ndarrays, or in this case, tuples of ndarrays, which in practice equates to 'zipping' the ndarrays in the tuple and acting on that, without incurring the overhead of doing so. So we check for inclusion in a 1d set, of each item zipped with its row index; so matches are only registered when both the row index and the numerical value coincide.

Divakars solution probably has a speed advantage, but both solutions should be of the same time-complexity. And the solution posted here works with arbitrary dtypes; or even if the tokes you are trying to match are ndarrays themselves rather than scalars!

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
0

Maybe something like this:

>A = np.zeros((5, 5))
>B = np.ones((5, 5))
>B[2, :] = 0
>print(A)
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])
>print(B)
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])
>(A == B).all(axis=1)
array([False, False,  True, False, False])
cyberspace
  • 191
  • 1
  • 2
  • 11
  • How is this supposed to be the same as the code above? If I run OP's code, I get a completely different result... – norok2 Sep 18 '18 at 10:29
  • 1
    @cyberspace Do you understand what `np.isin` does? Can you read the docs? – Divakar Sep 18 '18 at 10:30