How do you Check if each Row of a Numpy Array is Contained in a Secondary Array?

Question

My question is similar to testing whether a Numpy array contains a given row but instead I need a non-trivial extension to the method offered in the linked question; the linked question is asking how to check if each row in an array is the same as a single other row. The point of this question is to do that for numerous rows, one does not obviously follow from the other.

Say I have an array:

array = np.array([[1, 2, 4], [3, 5, 1], [5, 5, 1], [1, 2, 1]])

I want to know if each row of this array is in a secondary array given by:

check_array = np.array([[1, 2, 4], [1, 2, 1]])

Ideally this would look something like this:

is_in_check = array in check_array

Where is_in_check looks like this:

is_in_check = np.array([True, False, False, True])

I realise for very small arrays it would be easier to use a list comprehension or something similar, but the process has to be performant with arrays on the order of 10⁶ rows.

I have seen that for checking for a single row the correct method is:

is_in_check_single = any((array[:]==[1, 2, 1]).all(1))

But ideally I'd like to generalise this over multiple rows so that the process is vectorized.

In practice, I would expect to see the following dimensions for each array:

array.shape = (1000000, 3)
check_array.shape = (5, 3)

Can you provide dimensions you expect to see in practice? e.g. `array.shape`, `check_array.shape`. It would also help to know the number of unique values that can appear in the arrays (e.g. `1, 2, 3, 4, 5 -> 5` in this example. — hilberts_drinking_problem, May 19 '21 at 11:27
Apologies, I think I made that confusing by using "indexes" instead of "rows" when describing how long it might be. I've fixed that, and given the expected shapes at the bottom. The algorithm is a symmetry finding algorithm based on distance, so I would imagine that there would only be 50-100 unique rows in 1,000,000 row array. — Connor, May 19 '21 at 11:38

score 6 · Accepted Answer · answered May 19 '21 at 11:28

Broadcasting is an option:

import numpy as np

array = np.array([[1, 2, 4], [3, 5, 1], [5, 5, 1], [1, 2, 1]])

check_array = np.array([[1, 2, 4], [1, 2, 1]])

is_in_check = (check_array[:, None] == array).all(axis=2).any(axis=0)

Produces:

[ True False False  True]

Broadcasting the other way:

is_in_check = (check_array == array[:, None]).all(axis=2).any(axis=1)

Also Produces

[ True False False  True]

How do you Check if each Row of a Numpy Array is Contained in a Secondary Array?

1 Answers1

Linked

Related