7

I want to get the indices of the intersecting rows of a main numpy 2d array A, with another one B.

A=array([[1, 2],
         [3, 4],
         [5, 6],
         [7, 8],
         [9, 10]])

B=array([[1, 4],
         [1, 2],
         [5, 6],
         [6, 3]])

result=[0,2]

Where this should return [0,2] based on the indices of array A.

How can this be done efficiently for 2d arrays?

Thank you!

edit

I have tried the function:

k[np.in1d(k.view(dtype='i,i').reshape(k.shape[0]),k2.view(dtype='i,i').
reshape(k2.shape[0]))]

from Implementation of numpy in1d for 2D arrays? but I get a reshape error. My datatype is floats (with two decimals). Moreover, I also tried with sets but the performance is quite slow.

Community
  • 1
  • 1
Yannis Assael
  • 1,099
  • 2
  • 20
  • 43
  • 1
    Is there anything you tried yourself that didn't work? – Tim May 22 '14 at 18:37
  • 1
    Yes I tried k[np.in1d(k.view(dtype='i,i').reshape(k.shape[0]),k2.view(dtype='i,i').reshape(k2.shape[0]))] from http://stackoverflow.com/questions/16210738/numpy-in1d-for-2d-arrays. But I get a reshape error. – Yannis Assael May 22 '14 at 18:39
  • 1
    Ah ok, can you edit that in to the question so everyone can see it clearly? – Tim May 22 '14 at 18:40
  • 1
    Why don't you just iterate through array A, keeping track of your index, and then check `A[i] in B`? You could even convert B to a set (the sub lists would need to become tuples) so that the membership check is constant time. – Nacho May 22 '14 at 18:48
  • 1
    I thought this is kind of inefficient. – Yannis Assael May 22 '14 at 18:53
  • 1
    It would just be O(len(A)) operations (using a set for B). Doesn't seem so bad. But maybe there's faster ways out there. – Nacho May 22 '14 at 18:55
  • 1
    It largely depends on the inner structure of your data. Are your arrays sorted in some way? Do they have different lengths? – Hans Then May 22 '14 at 19:03

2 Answers2

5

With minimal changes, you can get your approach to work:

In [15]: A
Out[15]: 
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10]])

In [16]: B
Out[16]: 
array([[1, 4],
       [1, 2],
       [5, 6],
       [6, 3]])

In [17]: np.in1d(A.view('i,i').reshape(-1), B.view('i,i').reshape(-1))
Out[17]: array([ True, False,  True, False, False], dtype=bool)

In [18]: np.nonzero(np.in1d(A.view('i,i').reshape(-1), B.view('i,i').reshape(-1)))
Out[18]: (array([0, 2], dtype=int64),)

In [19]: np.nonzero(np.in1d(A.view('i,i').reshape(-1), B.view('i,i').reshape(-1)))[0]
Out[19]: array([0, 2], dtype=int64)

If your arrays are not floats, and are both contiguous, then the following will be faster:

In [21]: dt = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))

In [22]: np.nonzero(np.in1d(A.view(dt).reshape(-1), B.view(dt).reshape(-1)))[0]
Out[22]: array([0, 2], dtype=int64)

And a quick timing:

In [24]: %timeit np.nonzero(np.in1d(A.view('i,i').reshape(-1), B.view('i,i').reshape(-1)))[0]
10000 loops, best of 3: 75 µs per loop

In [25]: %timeit np.nonzero(np.in1d(A.view(dt).reshape(-1), B.view(dt).reshape(-1)))[0]
10000 loops, best of 3: 29.8 µs per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • can you please explain lines 21 and 22? It seems as if you are coercing to some other datatype and setting A as the same format. However, when I try with my own 2D array- call it C- that is shape(25257, 4) and dtype(' – Anna Jan 08 '19 at 21:51
2

You can use np.char.array() objects to do this comparison using np.in1d():

s1 = np.char.array(A[:,0]) + '-' + np.char.array(A[:,1])
s2 = np.char.array(B[:,0]) + '-' + np.char.array(B[:,1])

np.where(np.in1d(s1, s2))[0]
#array([0, 2], dtype=int64)

NOTE: A and B must be of the same data type (int, float, etc) for this to work.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234