How can i find the intersection of two multidimensional arrays faster?

Question

there are two multidimensional boolean arrays with a different number of rows. I want to quickly find indexes of True values in common rows. I wrote the following code but it is too slow. Is there a faster way to do this?

a=np.random.choice(a=[False, True], size=(100,100))
b=np.random.choice(a=[False, True], size=(1000,100))

for i in a:
    for j in b:
        if np.array_equal(i, j):
          print(np.where(i))

Gulzar · Accepted Answer · 2021-06-24T16:35:10.623

Let's start with an edition to the question that makes sense and usually prints something:

a = np.random.choice(a=[False, True], size=(2, 2))
b = np.random.choice(a=[False, True], size=(4, 2))

print(f"a: \n {a}")
print(f"b: \n {b}")

matches = []
for i, x in enumerate(a):
    for j, y in enumerate(b):
        if np.array_equal(x, y):
            matches.append((i, j))

And the solution using scipy.cdist which compares all rows in a against all rows in b, using hamming distance for Boolean vector comparison:

import numpy as np
import scipy
from scipy import spatial

d = scipy.spatial.distance.cdist(a, b, metric='hamming')
cdist_matches = np.where(d == 0)
mathces_values = [(a[i], b[j]) for (i, j) in matches]
cdist_values = a[cdist_matches[0]], b[cdist_matches[1]]
print(f"matches_inds = \n{matches}")
print(f"matches = \n{mathces_values}")

print(f"cdist_inds = \n{cdist_matches}")
print(f"cdist_matches =\n {cdist_values}")

out:

a: 
 [[ True False]
 [False False]]
b: 
 [[ True  True]
 [ True False]
 [False False]
 [False  True]]
matches_inds = 
[(0, 1), (1, 2)]
matches = 
[(array([ True, False]), array([ True, False])), (array([False, False]), array([False, False]))]
cdist_inds = 
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
cdist_matches =
 (array([[ True, False],
       [False, False]]), array([[ True, False],
       [False, False]]))

See this for a pure numpy implementation if you don't want to import scipy

@Gulzar I added a solution below, making a broadcastable to b and comparing each row of a to each row of b, if I have understood the question correctly — Tom McLean, Jun 24 '21 at 21:10

score 0 · Answer 2 · answered Jun 24 '21 at 21:07

0

The comparision of each row of a to each row of b can be made by making the shape of a broadcastable to the shape of b with the use of np.newaxis and np.tile

import numpy as np

a=np.random.choice(a=[True, False], size=(2,5))
b=np.random.choice(a=[True, False], size=(10,5))
broadcastable_a = np.tile(a[:, np.newaxis, :], (1, b.shape[0], 1))
a_equal_b = np.equal(b, broadcastable_a)
indexes = np.where(a_equal_b)
indexes = np.stack(np.array(indexes[1:]), axis=1)

answered Jun 24 '21 at 21:07

Tom McLean

5,583
1
11
36

I think it won't work because it only compares `b` as blocks and not by row. Maybe I didn't understand correctly. Please also add the code to convert back from the result of `.where` to the required indices. Also please show output. – Gulzar Jun 25 '21 at 10:58

How can i find the intersection of two multidimensional arrays faster?

2 Answers2