2

I have a large numpy array

X= np.random.rand(1000,1000)

then I have a smaller numpy array, which in this example could be generated by

Y =X [[3,7,921],:]

I would like to check programmatically whether all the rows of Y are in X.

The solution should scale up well. Possibly without requiring additional dependencies.

Inspired by: Get intersecting rows across two 2D numpy arrays

So far I have tried:

np.all([s in set([tuple(X) for x in X]) for s in set([tuple(y) for y in Y])])

but my logic must be flawed since I still get False when it should be True

Divakar
  • 218,885
  • 19
  • 262
  • 358
00__00__00
  • 4,834
  • 9
  • 41
  • 89

1 Answers1

2

Here's one approach with views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

X1D, Y1D = view1D(X,Y)
out = np.in1d(X1D, Y1D).sum() == len(Y1D)

Sample run -

In [66]: X = np.random.rand(1000,1000)

In [67]: Y = X [[3,7,921],:]

In [68]: X1D, Y1D = view1D(X,Y)

In [69]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[69]: True

In [70]: Y[2] = 1

In [71]: X1D, Y1D = view1D(X,Y)

In [72]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[72]: False

Benchmarking and alternatives

The proposed way seems pretty fast -

In [73]: %timeit np.in1d(X1D, Y1D).sum() == len(Y1D)
10000 loops, best of 3: 39.6 µs per loop

We can also do - np.in1d(Y1D, X1D).all(), but I think this would iterate through the elements of X1D for a match in Y1D. Now, in our case, it seems X1D i.e. X is a larger array, so this would be computationally more heavy than using np.in1d(X1D, Y1D) as with the earlier proposed one -

In [84]: %timeit np.in1d(Y1D, X1D)
100 loops, best of 3: 5.82 ms per loop

In [85]: %timeit np.in1d(X1D, Y1D)
10000 loops, best of 3: 34.1 µs per loop

Hence, the alternative solution would be slower -

In [79]: %timeit np.in1d(Y1D, X1D).all()
100 loops, best of 3: 5.99 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This relies on a direct memory comparison, so it could produce unexpected results for -0.0 or NaN. Also, I think you switched the arguments to `in1d` in your first code snippet. – user2357112 Jan 18 '18 at 18:35
  • @user2357112 Yup, edited. For those corner cases, I would let OP get back to us, if there's such in their cases. – Divakar Jan 18 '18 at 18:39
  • while you answer is correct, it would be really nice to know in which cases it does work... mayeb you could still edit the answer?thanks – 00__00__00 Jan 18 '18 at 18:48
  • @ErroriSalvo Not sure I follow - `it would be really nice to know in which cases it does work`? – Divakar Jan 18 '18 at 20:06
  • @user2357112 said it could produce unexpected results for -0.0 or NaN. That is a limitation...any other value for which it could produce unexpected results? – 00__00__00 Jan 19 '18 at 08:10
  • 1
    @ErroriSalvo Well `np.nan == np.nan` yields False. So, if you want to handle NaNs, you need to tell us the desired results. This solution should be fine for negative numbers too. Can't think of anything else. – Divakar Jan 19 '18 at 08:15
  • thanks! I have added a minimal input validation to check for nans. Nice and fast solution anyway – 00__00__00 Jan 19 '18 at 08:16