This probably will not be the most efficient (though it turns out to be faster than the other approaches presented here for this input -- see below), but one thing you can do is convert a
and b
to Python lists and then take their set difference:
# Method 1
tmp_1 = [tuple(i) for i in a] # -> [(1, 2), (1, 3), (1, 4)]
tmp_2 = [tuple(i) for i in b] # -> [(1, 2), (1, 3)]
c = np.array(list(set(tmp_1).difference(tmp_2)))
As noted by @EmiOB, this post offers some insights into why [ d for d in a if d not in b ]
in your question does not work. Drawing from that post, you can use
# Method 2
c = np.array([d for d in a if all(any(d != i) for i in b)])
Remarks
The implementation of array_contains(PyArrayObject *self, PyObject *el)
(in C) says that calling array_contains(self, el)
(in C) is equivalent to
(self == el).any()
in Python,
where self
is a pointer to an array and el
is a pointer to a Python object.
In other words:
- if
arr
is a numpy array and obj
is some arbitrary Python object, then
obj in arr
is the same as
(arr == obj).any()
- if
arr
is a typical Python container such as a list, tuple, dictionary, and so on, then
obj in arr
is the same as
any(obj is _ or obj == _ for _ in arr)
(see membership test operations).
All of which is to say, the meaning of obj in arr
is different depending on the type of arr
.
This explains why the logical comprehension that you proposed [d for d in a if d not in b]
does not have the desired effect.
This can be confusing because it is tempting to reason that since a numpy array is a sequence (though not a standard Python one), test membership semantics should be the same. This is not the case.
Example:
a = np.array([[1,2],[1,3],[1,4]])
print((a == [1,2]).any()) # same as [1, 2] in a
# outputs True
Timings
For your input, I found my approach to be the fastest, followed by Method 2 obtained from the post @EmiOB suggested, followed by @DanielF's approach. I would not be surprised if changing the input size changes the ordering of the timings so take them with a grain of salt.
# Method 1
5.96 µs ± 8.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Method 2
6.45 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# @DanielF's answer
16.5 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)