I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,'rows')
. The arrays are large enough that the obvious looping methods I could think of take too long.

- 11,158
- 2
- 34
- 58

- 335
- 2
- 3
- 10
-
What do you mean by "set difference"? – reptilicus Aug 10 '12 at 13:52
-
@user1443118 I'm guessing that he means "values in A that are not in B." as per http://www.mathworks.com/help/techdoc/ref/setdiff.html. – Hooked Aug 10 '12 at 13:55
-
"set difference" as in "set difference" the set theory operation? – Pablo Santa Cruz Aug 10 '12 at 13:56
-
How does you 2-d array look like? a list of lists? – Pablo Santa Cruz Aug 10 '12 at 13:57
-
Are the arrays the same dimensions? – reptilicus Aug 10 '12 at 14:00
-
Not quite what you are looking for, but there is a 1D version (setdiff1d) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.setdiff1d.html] – Hooked Aug 10 '12 at 14:02
-
The 2-d arrays are numpy array objects with the same number of columns but different numbers of rows. – zss Aug 10 '12 at 14:03
3 Answers
This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:
>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
>>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
>>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
array([[1, 2, 3]])
You can do this in Python, but it might be slow:
>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = set(map(tuple, a1))
>>> a2_rows = set(map(tuple, a2))
>>> a1_rows.difference(a2_rows)
set([(1, 2, 3)])

- 64,866
- 22
- 157
- 202
-
Thanks. The bottom method eventually crashed, but once I figure out how to install the new version of numpy I'll try the top method. – zss Aug 11 '12 at 13:56
Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff
probably does).
from numpy import *
# Create some sample arrays
A =random.randint(0,5,(10,3))
B =random.randint(0,5,(10,3))
As an example, this is what I got - note that there is one common element:
>>> A
array([[1, 0, 3],
[0, 4, 2],
[0, 3, 4],
[4, 4, 2],
[2, 0, 2],
[4, 0, 0],
[3, 2, 2],
[4, 2, 3],
[0, 2, 1],
[2, 0, 2]])
>>> B
array([[4, 1, 3],
[4, 3, 0],
[0, 3, 3],
[3, 0, 3],
[3, 4, 0],
[3, 2, 3],
[3, 1, 2],
[4, 1, 2],
[0, 4, 2],
[0, 0, 3]])
We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:
idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)
As a check:
>>> A[idx[0]]
array([[0, 4, 2]])
>>> B[idx[1]]
array([[0, 4, 2]])

- 84,485
- 43
- 192
- 261
-
Can the downvoter explain? I'm welcome to any criticism or comments on how to improve. – Hooked Aug 10 '12 at 15:43
-
Thanks for the clever code (I'll remember the newaxis formulation). Unfortunately, when I tried it I got the error: "ValueError: array is too big." – zss Aug 11 '12 at 14:08
-
@user1590405 When you run `A.size()` and `B.size()` how big are the arrays? – Hooked Aug 11 '12 at 18:51
I'm not sure what you are going for, but this will get you a boolean array of where 2 arrays are not equal, and will be numpy fast:
import numpy as np
a = np.random.randn(5, 5)
b = np.random.randn(5, 5)
a[0,0] = 10.0
b[0,0] = 10.0
a[1,1] = 5.0
b[1,1] = 5.0
c = ~(a-b==0)
print c
[[False True True True True]
[ True False True True True]
[ True True True True True]
[ True True True True True]
[ True True True True True]]

- 10,290
- 6
- 55
- 79
-
1This is not correct, it compares the elements. OP is looking for the set diff of the **rows**. – Hooked Aug 10 '12 at 14:29
-
It's true that "`a[0, c[0]]` gives the elements in the 0 row of a not in b", but the way I read the question was not to find the elements of A and B per row that were identical, but to find the _rows_ of A and the _rows_ of B that matched. – Hooked Aug 10 '12 at 15:38
-
From the match matrix however you can easily go to an array giving the row match using `np.all(match_matrix, axis=0)` however. – Okarin Jul 16 '15 at 17:47