15

I have two 2D arrays of the same size

a = array([[1,2],[3,4],[5,6]])
b = array([[1,2],[3,4],[7,8]])

I want to know the rows of b that are in a.

So the output should be :

array([ True,  True, False], dtype=bool)

without making :

array([any(i == a) for i in b])

cause a and b are huge.

There is a function that does this but only for 1D arrays : in1d

amine23
  • 327
  • 3
  • 10

4 Answers4

15

What we'd really like to do is use np.in1d... except that np.in1d only works with 1-dimensional arrays. Our arrays are multi-dimensional. However, we can view the arrays as a 1-dimensional array of strings:

arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

For example,

In [15]: arr = np.array([[1, 2], [2, 3], [1, 3]])

In [16]: arr = arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

In [30]: arr.dtype
Out[30]: dtype('V16')

In [31]: arr.shape
Out[31]: (3, 1)

In [37]: arr
Out[37]: 
array([[b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'],
       [b'\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
       [b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00']],
      dtype='|V16')

This makes each row of arr a string. Now it is just a matter of hooking this up to np.in1d:

import numpy as np

def asvoid(arr):
    """
    Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
    View the array as dtype np.void (bytes). The items along the last axis are
    viewed as one value. This allows comparisons to be performed on the entire row.
    """
    arr = np.ascontiguousarray(arr)
    if np.issubdtype(arr.dtype, np.floating):
        """ Care needs to be taken here since
        np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
        Adding 0. converts -0. to 0.
        """
        arr += 0.
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))


def inNd(a, b, assume_unique=False):
    a = asvoid(a)
    b = asvoid(b)
    return np.in1d(a, b, assume_unique)


tests = [
    (np.array([[1, 2], [2, 3], [1, 3]]),
     np.array([[2, 2], [3, 3], [4, 4]]),
     np.array([False, False, False])),
    (np.array([[1, 2], [2, 2], [1, 3]]),
     np.array([[2, 2], [3, 3], [4, 4]]),
     np.array([True, False, False])),
    (np.array([[1, 2], [3, 4], [5, 6]]),
     np.array([[1, 2], [3, 4], [7, 8]]),
     np.array([True, True, False])),
    (np.array([[1, 2], [5, 6], [3, 4]]),
     np.array([[1, 2], [5, 6], [7, 8]]),
     np.array([True, True, False])),
    (np.array([[-0.5, 2.5, -2, 100, 2], [5, 6, 7, 8, 9], [3, 4, 5, 6, 7]]),
     np.array([[1.0, 2, 3, 4, 5], [5, 6, 7, 8, 9], [-0.5, 2.5, -2, 100, 2]]),
     np.array([False, True, True]))
]

for a, b, answer in tests:
    result = inNd(b, a)
    try:
        assert np.all(answer == result)
    except AssertionError:
        print('''\
a:
{a}
b:
{b}

answer: {answer}
result: {result}'''.format(**locals()))
        raise
else:
    print('Success!')

yields

Success!
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    View it as a record array, I think `.view(dtype([(´´, a.dtype)*a.shape[1]]))` is what you need, and you have the same trick working for any type. – Jaime Apr 25 '13 at 14:08
  • @Jaime: I tried `a1d = a.view([('f0','int32'),('f1','int32')])`, `b1d = ...`, `np.in1d(a1d, b1d)` but got a TypeError. If you see a way around this, I'd love to know. – unutbu Apr 25 '13 at 14:13
  • This also strangely fails the test case I posted as a comment on @Jan's answer. – amine23 Apr 25 '13 at 14:38
  • But I don't understand why it fails on that case. Also, should this work if I have more than two columns? – amine23 Apr 25 '13 at 15:11
  • @amine23: It seems I had the order of `a` and `b` backwards in the call to `in1d`. I think it is correct now. If you have more than two columns, it could still work, **but only if** there is a basic dtype (such as `float64`) which is exactly the same size as the total size of one row (such as 4 columns of `float16`). – unutbu Apr 25 '13 at 18:25
  • 3
    Funny that mergesort doesn't work with generalized dtypes... While it still won't work with this, the simplest way I found to join many fields in a single dtype is `dtype((np.void, a.dtype.itemsize*a.shape[1]))`. – Jaime Apr 25 '13 at 18:34
  • @Jaime: Thanks, I hadn't seen that before. – unutbu Apr 25 '13 at 19:03
  • @unutbu thanks! you made a great job, but unfortunately it won't work in my case cause I have 5 columns :( – amine23 Apr 25 '13 at 22:06
  • @amine23: Okay, in that case, I can not think of a better way than Jan's original method (the one using `np.in1d` once for each column.) I've added an expression which generalizes Jan's method to an arbitrary number of columns. – unutbu Apr 25 '13 at 22:17
  • @unutbu Here is a sample row : `array([-0.5,2.5,-2,100,2])` what type should this be viewed as ? – amine23 Apr 25 '13 at 22:23
  • @amine23: Unfortunately, I don't think there is a `dtype` which will work in this case. – unutbu Apr 25 '13 at 22:26
  • @unutbu Jan's original method failed the test case posted in the first comment. – amine23 Apr 25 '13 at 22:28
  • @unutbu problematic case :\ `a = np.array([[1,2],[2,3],[1,3]])` `b = np.array([[2,2],[3,3],[4,4]])` – amine23 Apr 25 '13 at 22:54
  • Very nice solution with the conversion to strings! However, I've noticed lately that the view in the solution above only works with Python2 and not Python3. If you use Python 3, you may get the following error: ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array. Here's a code snippet: https://pastebin.com/XU3kVBP9. Interestingly, it works if you change the dtype from float32 to int32. – 0vbb Jan 09 '19 at 14:26
  • 1
    @0vbb: In Python2, `np.str` was a dtype representing bytes. In Python3, `np.str` represents unicode strings. Here, we want to compare values as bytes not unicode strings. `np.void` serves this purpose on both Python2 and Python3. – unutbu Jan 09 '19 at 14:50
  • 1
    @0vbb: I've updated the code above with [Jaime's idea](https://stackoverflow.com/questions/16216078/test-for-membership-in-a-2d-numpy-array/16216866#comment23201476_16216866) of using `np.void` dtype instead of `np.str`. This avoids the `ValueError` you were seeing too. – unutbu Jan 09 '19 at 15:00
4
In [1]: import numpy as np

In [2]: a = np.array([[1,2],[3,4]])

In [3]: b = np.array([[3,4],[1,2]])

In [5]: a = a[a[:,1].argsort(kind='mergesort')]

In [6]: a = a[a[:,0].argsort(kind='mergesort')]

In [7]: b = b[b[:,1].argsort(kind='mergesort')]

In [8]: b = b[b[:,0].argsort(kind='mergesort')]

In [9]: bInA1 = b[:,0] == a[:,0]

In [10]: bInA2 = b[:,1] == a[:,1]

In [11]: bInA = bInA1*bInA2

In [12]: bInA
Out[12]: array([ True,  True], dtype=bool)

should do this... Not sure, whether this is still efficient. You need do mergesort, as other methods are unstable.

Edit:

If you have more than 2 columns and if the rows are sorted already, you can do

In [24]: bInA = np.array([True,]*a.shape[0])

In [25]: bInA
Out[25]: array([ True,  True], dtype=bool)

In [26]: for k in range(a.shape[1]):
    bInAk = b[:,k] == a[:,k]
    bInA = bInAk*bInA
   ....:     

In [27]: bInA
Out[27]: array([ True,  True], dtype=bool)

There is still space for speeding up, as in the iteration, you don't have to check the entire column, but only the entries where the current bInA is True.

Jan
  • 4,932
  • 1
  • 26
  • 30
3

If you have smth like a=np.array([[1,2],[3,4],[5,6]]) and b=np.array([[5,6],[1,2],[7,6]]), you can convert them into complex 1-D array:

c=a[:,0]+a[:,1]*1j
d=b[:,0]+b[:,1]*1j

This whole stuff in my Interpreter looks like this:

>>> c=a[:,0]+a[:,1]*1j
>>> c
array([ 1.+2.j,  3.+4.j,  5.+6.j])
>>> d=b[:,0]+b[:,1]*1j
>>> d
array([ 5.+6.j,  1.+2.j,  7.+6.j])

And now that you have just 1D array, you can easily do np.in1d(c,d), and the Python will give you:

>>> np.in1d(c,d)
array([ True, False,  True], dtype=bool)

And with this you don't need any loops, at least with this data type

Oresto
  • 135
  • 2
  • 11
0

the numpy module can actually broadcast through your array and tell what parts are the same as the other and return true if they are and false if they are not:

import numpy as np
a = np.array(([1,2],[3,4],[5,6])) #converting to a numpy array
b = np.array(([1,2],[3,4],[7,8])) #converting to a numpy array
new_array = a == b #creating a new boolean array from comparing a and b

now new_array looks like this:

[[ True  True]
 [ True  True]
 [False False]]

but that is not what you want. So you can transpose (flip x and y) the array and then compare the two rows with an & gate. This will now create a 1-D array that will only return true if both columns in the row are true:

new_array = new_array.T #transposing
result = new_array[0] & new_array[1] #comparing rows

when you print result you now get what you're looking for:

[ True  True False]
Ryan Saxe
  • 17,123
  • 23
  • 80
  • 128
  • What if `a = array([[1,2],[3,4]])` and `b = array([[3,4],[1,2]])` ? – amine23 Apr 25 '13 at 14:46
  • it was not really clear that you wanted to be able to compare all. Your example didn't display that clearlyly...and you want to be able to check if a nested array in b is in a without using a for loop? – Ryan Saxe Apr 25 '13 at 18:39