Is there a way to take the set difference of numpy arrays containing tuples?

Question

I want to get the set difference of arrays of tuples. For instance, for

import numpy as np

a = np.empty((2,), dtype=object)
a[0] = (0, 1)
a[1] = (2, 3)

b = np.empty((1,), dtype=object)
b[0] = (2, 3)

I would like to have a function, say set_diff, such that set_diff(a, b) = array([(0, 1)], dtype=object). The function np.setdiff1d doesn't work, np.setdiff1d(a, b) yields array([(0, 1), (2, 3)], dtype=object).

Is there a function that does the job or a way to make np.setdiff1d have the desired behavior ?

hpaulj · Answer 1 · 2021-02-14T21:31:47.093

Why not use Python sets?

In [339]: a = [(0,1),(2,3)]; b = [(2,3)]
In [340]: set(a).difference(b)
Out[340]: {(0, 1)}

Object dtype arrays don't gain much, if any thing, over lists.

numpy is primarily a numeric array package. Object dtype arrays are something of an after-thought, and are more list like. Math on objects is hit-of-miss, and things like this that involve sorting and equality tests often don't work, or work in unpredictable ways. Don't give up on regular Python types like list, tuples and sets just because someone told you numpy is faster!

atru · Answer 2 · 2021-02-14T21:34:54.087

2

The problem you're facing is actually documented in a number of Stack Overflow posts. It turns out that this specific function, np.setdiff1d, does not operate on arrays of tuples, and seemingly any objects aside scalars. This post provides a simple demonstration, and this one gives some more details on the issue.

Both posts have good fixes for related problems. In case of your code, if computationally permissible, I would move back to Python sets and lists:

import numpy as np

a = np.empty((3,), dtype=object)
a[0] = (0, 1)
a[1] = (2, 3)
a[2] = (0, 3)

b = np.empty((2,), dtype=object)
b[0] = (2, 3)
b[1] = (3, 4)

set_a = set(a)
set_b = set(b)
output = np.array(list(set_a-set_b))

print(set_a)
print(set_b)
print(output)

Here the code ends up turning the set difference back to Numpy array, but of course that is optional. Also, I extended your example a little, to demonstrate the dimensions of the final array, and more importantly, the fact that it only contains the elements of a that are not in b, and not vice versa.

edited Feb 14 '21 at 21:34

answered Feb 14 '21 at 20:54

atru

4,699
2
18
19

1

`a` and `b` were created as 1d, so there's no value to adding the `flatten`. I don't think the `list()` adds anything either. `set(a)` should suffice. – hpaulj Feb 14 '21 at 21:14
Indeed, thank you @hpaulj - I rarely use Numpy and got distracted altogether :) yes, it is 1D and no need for intermediate conversion to a list. – atru Feb 14 '21 at 21:36
1

The `numpy` "set" functions depend on sorting and equality testing. That doesn't work with non-numeric values. The python 'set' works with hashing (like the python 'dict'), so works with immutables like tuples. – hpaulj Feb 15 '21 at 00:41

Is there a way to take the set difference of numpy arrays containing tuples?

2 Answers2