Get intersecting rows across two 2D numpy arrays

Question

I want to get the intersecting (common) rows across two 2D numpy arrays. E.g., if the following arrays are passed as inputs:

array([[1, 4],
       [2, 5],
       [3, 6]])

array([[1, 4],
       [3, 6],
       [7, 8]])

the output should be:

array([[1, 4],
       [3, 6])

I know how to do this with loops. I'm looking at a Pythonic/Numpy way to do this.

score 51 · Accepted Answer · answered Nov 29 '11 at 20:37

51

For short arrays, using sets is probably the clearest and most readable way to do it.

Another way is to use numpy.intersect1d. You'll have to trick it into treating the rows as a single value, though... This makes things a bit less readable...

import numpy as np

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
       'formats':ncols * [A.dtype]}

C = np.intersect1d(A.view(dtype), B.view(dtype))

# This last bit is optional if you're okay with "C" being a structured array...
C = C.view(A.dtype).reshape(-1, ncols)

For large arrays, this should be considerably faster than using sets.

answered Nov 29 '11 at 20:37

Joe Kington

275,208
71
604
463

2

Would `np.intersect1d(a, b).reshape(-1, ncols)` achieve the same result? – Rob Cowie Nov 29 '11 at 20:43
Scratch the earlier comment... You're quite right... It should work in all cases. – Joe Kington Nov 29 '11 at 20:52
3

Actually, no, it won't work. (I realized that earlier, and then forgot it!) Without the structured dtype, it doesn't consider things as rows, just the "raw" numbers. Consider something like `A = np.array([[4,1],[2,5],[3,6]])` and `B = np.array([[1,4],[3,6],[7,8]])`. – Joe Kington Nov 29 '11 at 20:56
I tried to use this, but it doesn't seem to work for me in python 2.6.5. – Karthik Nov 30 '11 at 05:38
1

@Karthik - Do you get a `ValueError: zero length field name in format`? I've used new-style string formatting. On python2.6, you'll need to put `'names':['f{0}'.format....` instead of `'names':['f{}'.format...` – Joe Kington Nov 30 '11 at 13:27
3

You can replace that dtype line with: `dtype = (', '.join([str(A.dtype)]*ncols))`. the name is not specified, so it defaults to f0, f1, etc. – Henry Schreiner Apr 14 '15 at 20:05
I know It is already years ago. I wonder how could you shed a light on me for my similar questions. But my A and B have different length, and also I want to find the corresponding index of A and B, not the common elements themselves. Thanks. – Huanian Zhang Dec 19 '16 at 08:56
The link of my question is: `http://stackoverflow.com/questions/41214903/get-intersect-indexes-across-two-2d-or-3d-numpy-arrays` – Huanian Zhang Dec 19 '16 at 09:11

mtrw · Answer 2 · 2011-11-29T20:50:16.020

23

You could use Python's sets:

>>> import numpy as np
>>> A = np.array([[1,4],[2,5],[3,6]])
>>> B = np.array([[1,4],[3,6],[7,8]])
>>> aset = set([tuple(x) for x in A])
>>> bset = set([tuple(x) for x in B])
>>> np.array([x for x in aset & bset])
array([[1, 4],
       [3, 6]])

As Rob Cowie points out, this can be done more concisely as

np.array([x for x in set(tuple(x) for x in A) & set(tuple(x) for x in B)])

There's probably a way to do this without all the going back and forth from arrays to tuples, but it's not coming to me right now.

edited Nov 29 '11 at 20:50

answered Nov 29 '11 at 20:17

mtrw

34,200
7
63
71

2

I concur. Can't find any 'native' numpy way of doing it. One-line version might be `common = set(tuple(i) for i in A) & set(tuple(i) for i in B)` – Rob Cowie Nov 29 '11 at 20:35
2

If you're going to use set you can use the intersection function: set.intersection( aset, bset) – rhinoinrepose Nov 29 '11 at 23:13
1

@rhinoinrepose - is using the function faster than `&` ? – mtrw Nov 30 '11 at 08:16

Vasilis Lemonidis · Answer 3 · 2017-06-02T12:43:48.977

I could not understand why there is no suggested pure numpy way to get this working. So I found one, that uses numpy broadcast. The basic idea is to transform one of the arrays to 3d by axes swapping. Let's construct 2 arrays:

a=np.random.randint(10, size=(5, 3))
b=np.zeros_like(a)
b[:4,:]=a[np.random.randint(a.shape[0], size=4), :]

With my run it gave:

a=array([[5, 6, 3],
   [8, 1, 0],
   [2, 1, 4],
   [8, 0, 6],
   [6, 7, 6]])
b=array([[2, 1, 4],
   [2, 1, 4],
   [6, 7, 6],
   [5, 6, 3],
   [0, 0, 0]])

The steps are (arrays can be interchanged) :

#a is nxm and b is kxm
c = np.swapaxes(a[:,:,None],1,2)==b #transform a to nx1xm
# c has nxkxm dimensions due to comparison broadcast
# each nxixj slice holds comparison matrix between a[j,:] and b[i,:]
# Decrease dimension to nxk with product:
c = np.prod(c,axis=2)
#To get around duplicates://
# Calculate cumulative sum in k-th dimension
c= c*np.cumsum(c,axis=0)
# compare with 1, so that to get only one 'True' statement by row
c=c==1
#//
# sum in k-th dimension, so that a nx1 vector is produced
c=np.sum(c,axis=1).astype(bool)
# The intersection between a and b is a[c]
result=a[c]

In a function with 2 lines for used memory reduction (correct me if wrong):

def array_row_intersection(a,b):
   tmp=np.prod(np.swapaxes(a[:,:,None],1,2)==b,axis=2)
   return a[np.sum(np.cumsum(tmp,axis=0)*tmp==1,axis=1).astype(bool)]

which gave result for my example:

result=array([[5, 6, 3],
       [2, 1, 4],
       [6, 7, 6]])

This is faster than set solutions, as it makes use only of simple numpy operations, while it reduces constantly dimensions, and is ideal for two big matrices. I guess I might have made mistakes in my comments, as I got the answer by experimentation and instinct. The equivalent for column intersection can either be found by transposing the arrays or by changing the steps a little. Also, if duplicates are wanted, then the steps inside "//" have to be skipped. The function can be edited to return only the boolean array of the indices, which came handy to me ,while trying to get different arrays indices with the same vector. Benchmark for the voted answer and mine (number of elements in each dimension plays role on what to choose):

Code:

def voted_answer(A,B):
    nrows, ncols = A.shape
    dtype={'names':['f{}'.format(i) for i in range(ncols)],
           'formats':ncols * [A.dtype]}
    C = np.intersect1d(A.view(dtype), B.view(dtype))
    return C.view(A.dtype).reshape(-1, ncols)

a_small=np.random.randint(10, size=(10, 10))
b_small=np.zeros_like(a_small)
b_small=a_small[np.random.randint(a_small.shape[0],size=[a_small.shape[0]]),:]
a_big_row=np.random.randint(10, size=(10, 1000))
b_big_row=a_big_row[np.random.randint(a_big_row.shape[0],size=[a_big_row.shape[0]]),:]
a_big_col=np.random.randint(10, size=(1000, 10))
b_big_col=a_big_col[np.random.randint(a_big_col.shape[0],size=[a_big_col.shape[0]]),:]
a_big_all=np.random.randint(10, size=(100,100))
b_big_all=a_big_all[np.random.randint(a_big_all.shape[0],size=[a_big_all.shape[0]]),:]



print 'Small arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_small,b_small),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_small,b_small),number=100)/100
print 'Big column arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_col,b_big_col),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_col,b_big_col),number=100)/100
print 'Big row arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_row,b_big_row),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_row,b_big_row),number=100)/100
print 'Big arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_all,b_big_all),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_all,b_big_all),number=100)/100

with results:

Small arrays:
     Voted answer: 7.47108459473e-05
     Proposed answer: 2.47001647949e-05
Big column arrays:
     Voted answer: 0.00198730945587
     Proposed answer: 0.0560171294212
Big row arrays:
     Voted answer: 0.00500325918198
     Proposed answer: 0.000308241844177
Big arrays:
     Voted answer: 0.000864889621735
     Proposed answer: 0.00257176160812

Following verdict is that if you have to compare 2 big 2d arrays of 2d points then use voted answer. If you have big matrices in all dimensions, voted answer is the best one by all means. So, it depends on what you choose each time.

Getting ValueError: 'axis' entry is out of bounds for 'c = np.prod(c,axis=2)' line — MajesticRa, Jun 01 '17 at 13:58
That's weird, c is supposed to have 3 dimensions, by issuing the command `c = np.swapaxes(a[:,:,None],1,2)==b' .... — Vasilis Lemonidis, Jun 01 '17 at 15:21
very neat. it is also possible to return indices with a small modification. `np.where(np.prod(np.swapaxes(Array_A[:,:,None],1,2) == Array_B,axis=2).astype(bool))` — bmg, Mar 11 '21 at 00:44

Shubham Sharma · Answer 4 · 2021-04-15T17:30:22.803

10

`Numpy broadcasting`

We can create a boolean mask using broadcasting which can be then used to filter the rows in array A which are also present in array B

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

m = (A[:, None] == B).all(-1).any(1)

>>> A[m]

array([[1, 4],
       [3, 6]])

edited Apr 15 '21 at 17:30

answered Apr 15 '21 at 17:18

Shubham Sharma

68,127
6
24
53

score 5 · Answer 5 · answered Nov 30 '11 at 09:13

Another way to achieve this using structured array:

>>> a = np.array([[3, 1, 2], [5, 8, 9], [7, 4, 3]])
>>> b = np.array([[2, 3, 0], [3, 1, 2], [7, 4, 3]])
>>> av = a.view([('', a.dtype)] * a.shape[1]).ravel()
>>> bv = b.view([('', b.dtype)] * b.shape[1]).ravel()
>>> np.intersect1d(av, bv).view(a.dtype).reshape(-1, a.shape[1])
array([[3, 1, 2],
       [7, 4, 3]])

Just for clarity, the structured view looks like this:

>>> a.view([('', a.dtype)] * a.shape[1])
array([[(3, 1, 2)],
       [(5, 8, 9)],
       [(7, 4, 3)]],
       dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

score 2 · Answer 6 · answered Feb 05 '18 at 18:40

2

np.array(set(map(tuple, b)).difference(set(map(tuple, a))))

This could also work

answered Feb 05 '18 at 18:40

Espoir Murhabazi

5,973
5
42
73

RashidLadj_Winux · Answer 7 · 2020-10-16T15:09:49.683

Without Index Visit https://gist.github.com/RashidLadj/971c7235ce796836853fcf55b4876f3c

def intersect2D(Array_A, Array_B):
"""
Find row intersection between 2D numpy arrays, a and b.
"""

# ''' Using Tuple ''' #
intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B  if(tuple(x) == tuple(y))]))
print ("intersectionList = \n",intersectionList)

# ''' Using Numpy function "array_equal" ''' #
""" This method is valid for an ndarray """
intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B  if(np.array_equal(x, y))]))
print ("intersectionList = \n",intersectionList)

# ''' Using set and bitwise and '''
intersectionList = [list(y) for y in (set([tuple(x) for x in Array_A]) & set([tuple(x) for x in Array_B]))]
print ("intersectionList = \n",intersectionList)

return intersectionList

With Index Visit https://gist.github.com/RashidLadj/bac71f3d3380064de2f9abe0ae43c19e

def intersect2D(Array_A, Array_B):
  """
  Find row intersection between 2D numpy arrays, a and b.
  Returns another numpy array with shared rows and index of items in A & B arrays
  """
  # [[IDX], [IDY], [value]] where Equal
  # ''' Using Tuple ''' #
  IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B)  if(tuple(x) == tuple(y))]).T
  
  # ''' Using Numpy array_equal ''' #
  IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B)  if(np.array_equal(x, y))]).T
  
  idx, idy, intersectionList = (IndexEqual[0], IndexEqual[1], IndexEqual[2]) if len(IndexEqual) != 0 else ([], [], [])

  return intersectionList, idx, idy

John Blackmore · Answer 8 · 2019-08-02T02:35:16.703

1

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

def matching_rows(A,B):
  matches=[i for i in range(B.shape[0]) if np.any(np.all(A==B[i],axis=1))]
  if len(matches)==0:
    return B[matches]
  return np.unique(B[matches],axis=0)

>>> matching_rows(A,B)
array([[1, 4],
       [3, 6]])

This of course assumes the rows are all the same length.

edited Aug 02 '19 at 02:35

answered Aug 02 '19 at 00:54

John Blackmore

56
4

score 0 · Answer 9 · answered Aug 28 '19 at 18:07

0

import numpy as np

A=np.array([[1, 4],
       [2, 5],
       [3, 6]])

B=np.array([[1, 4],
       [3, 6],
       [7, 8]])

intersetingRows=[(B==irow).all(axis=1).any() for irow in A]
print(A[intersetingRows])

answered Aug 28 '19 at 18:07

SzorgosDiák

181
1
4

Get intersecting rows across two 2D numpy arrays

9 Answers9

`Numpy broadcasting`

Linked

Related