check for identical rows in different numpy arrays

Question

how do I get a row-wise comparison between two arrays, in the result of a row-wise true/false array?

Given datas:

a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])

Result step 1:

c = np.array([True, True,False,True])

Result final:

a = a[c]

So how do I get the array c ????

P.S.: In this example the arrays a and b are sorted, please give also information if in your solution it is important that the arrays are sorted

score 24 · Accepted Answer · answered Jul 15 '18 at 22:56

24

Here's a vectorised solution:

res = (a[:, None] == b).all(-1).any(-1)

print(res)

array([ True,  True, False,  True])

Note that a[:, None] == b compares each row of a with b element-wise. We then use all + any to deduce if there are any rows which are all True for each sub-array:

print(a[:, None] == b)

[[[ True  True]
  [False  True]
  [False False]]

 [[False  True]
  [ True  True]
  [False False]]

 [[False False]
  [False False]
  [False False]]

 [[False False]
  [False False]
  [ True  True]]]

answered Jul 15 '18 at 22:56

jpp

159,742
34
281
339

this looks good a = np.array([[1,0],[2,0],[4,2],[3,1],[3,0]]) b = np.array([[1,0],[2,0],[3,1]]) c = (a[:, None] == b).all(-1).any(-1) result [ True True False True False] – TomK Jul 16 '18 at 20:36
Similar to here https://stackoverflow.com/questions/53631460/using-numpy-isin-element-wise-between-2d-and-1d-arrays – Joe May 07 '20 at 11:48

Omer Shacham · Answer 2 · 2018-07-16T06:36:29.570

6

you can use numpy with apply_along_axis (kind of iteration on specific axis while axis=0 iterate on every cell, axis=1 iterate on every row, axis=2 on matrix and so on

import numpy as np
a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])
c = np.apply_along_axis(lambda x,y: x in y, 1, a, b)

edited Jul 16 '18 at 06:36

answered Jul 15 '18 at 22:20

Omer Shacham

618
4
11

1

This doesn't actually use **`np.isin`**, bit confused why you mentioned it, as I don't think it's particularly useful here. – user3483203 Jul 15 '18 at 22:45
1

seems not to work in order to check for identical rows: a = np.array([[1,0],[2,0],[4,2],[3,1],[3,0]]) b = np.array([[1,0],[2,0],[3,1]]) c = np.apply_along_axis(lambda x,y: x in y, 1, a, b) result is [ True True False True True] the last one should be false – TomK Jul 16 '18 at 20:34

Divakar · Answer 3 · 2018-07-16T04:45:58.313

Approach #1

We could use a view based vectorized solution -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

A,B = view1D(a,b)
out = np.isin(A,B)

Sample run -

In [8]: a
Out[8]: 
array([[1, 0],
       [2, 0],
       [3, 1],
       [4, 2]])

In [9]: b
Out[9]: 
array([[1, 0],
       [2, 0],
       [4, 2]])

In [10]: A,B = view1D(a,b)

In [11]: np.isin(A,B)
Out[11]: array([ True,  True, False,  True])

Approach #2

Alternatively for the case when all rows in b are in a and rows are lexicographically sorted, using the same views, but with searchsorted -

out = np.zeros(len(A), dtype=bool)
out[np.searchsorted(A,B)] = 1

If the rows are not necessarily lexicographically sorted -

sidx = A.argsort()
out[sidx[np.searchsorted(A,B,sorter=sidx)]] = 1

score 2 · Answer 4 · answered Jul 15 '18 at 22:18

2

You can do it as a list comp via:

c = np.array([row in b for row in a])

though this approach will be slower than a pure numpy approach (if it exists).

answered Jul 15 '18 at 22:18

James

32,991
4
47
70

Zev · Answer 5 · 2018-07-15T22:49:40.287

a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])

i = 0
j = 0
result = []

We can take advantage of the fact that they are sorted and do this in O(n) time. Using two pointers we just move ahead the pointer that has gotten behind:

while i < len(a) and j < len(b):
    if tuple(a[i])== tuple(b[j]):
        result.append(True)
        i += 1
        j += 1 # get rid of this depending on how you want to handle duplicates
    elif tuple(a[i]) > tuple(b[j]):
        j += 1
    else:
        result.append(False)
        i += 1

Pad with False if it ends early.

if len(result) < len(a):
    result.extend([False] * (len(a) - len(result)))

print(result) # [True, True, False, True]

This answer is adapted from Better way to find matches in two sorted lists than using for loops? (Java)

score 1 · Answer 6 · answered Apr 23 '20 at 15:57

You can use scipy's cdist which has a few advantages:

from scipy.spatial.distance import cdist

a = np.array([[1,0],[2,0],[3,1],[4,2]])
b = np.array([[1,0],[2,0],[4,2]])

c = cdist(a, b)==0
print(c.any(axis=1))

[ True  True False  True]

print(a[c.any(axis=1)])

[[1 0]
 [2 0]
 [4 2]]

Also, cdist allows passing of a function pointer. So you can specify your own distance functions, to do whatever comparison you need:

c = cdist(a, b, lambda u, v: (u==v).all())
print(c)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]
 [0. 0. 1.]]

And now you can find which index matches. Which will also indicate if there are multiple matches.

# Array with multiple instances
a2 = np.array([[1,0],[2,0],[3,1],[4,2],[3,1],[4,2]])

c2 = cdist(a2, b, lambda u, v: (u==v).all())
print(c2)

idx = np.where(c2==1)
print(idx)

print(idx[0][idx[1]==2])

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]
 [0. 0. 1.]
 [0. 0. 0.]
 [0. 0. 1.]]
(array([0, 1, 3, 5], dtype=int64), array([0, 1, 2, 2], dtype=int64))
[3 5]

Interesting approach. I wonder how it fares in terms of performance — Sterling, Sep 02 '21 at 00:50

score 1 · Answer 7 · answered Nov 24 '21 at 11:21

The recommended answer is good, but will struggle when dealing with arrays with a large number of rows. An alternative is:

baseval = np.max([a.max(), b.max()]) + 1
a[:,1] = a[:,1] * baseval
b[:,1] = b[:,1] * baseval
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))

This uses the maximum value contained in either array plus 1 as a numeric base and treats the columns as baseval^0 and baseval^1 values. This ensures that the sum of the columns are unique for each possible pair of values. If the order of the columns is not important then both input arrays can be sorted column-wise using np.sort(a,axis=1) beforehand.

This can be extended to arrays with more columns using:

baseval = np.max([a.max(), b.max()]) + 1
n_cols = a.shape[1]
a = a * baseval ** np.array(range(n_cols))
b = b * baseval ** np.array(range(n_cols))
c = np.isin(np.sum(a, axis=1), np.sum(b, axis=1))

Overflow can occur when baseval ** (n_cols+1) > 9223372036854775807 if using int64. This can be avoided by setting the numpy arrays to use python integers using dtype=object.

check for identical rows in different numpy arrays

7 Answers7

Linked

Related