Not sure what you intend to do with the extra dimensions, as the set difference, like any filtering, is inherently losing the shape information.
Anyway, NumPy does provide np.setdiff1d()
to solve this problem elegantly.
EDIT With the clarifications provided, you seems to be looking for a way compute the set difference on a given axis, i.e. the elements of the sets are actually arrays.
There is no built-in specifically for this in NumPy, but it is not too difficult to craft one.
For simplicity, we assume that the operating axis is the first one (so that the element of the set are arr[i]
), that only unique elements appear in the first array, and that the arrays are 2D.
They are all based on the idea that the asymptotically best approach is to build a set()
of the second array and then using that to filter out the entries from the first array.
The idiomatic way to build such set in Python / NumPy is to use:
set(map(tuple, arr))
where the mapping to tuple
freezes arr[i]
, allowing them to be hashable and hence making them available to use with set()
.
Unfortunately, since the filtering would produce results of unpredictable size, NumPy arrays are not the ideal container for the result.
To solve this issue, one can use:
- an intermediate
list
import numpy as np
def setdiff2d_list(arr1, arr2):
delta = set(map(tuple, arr2))
return np.array([x for x in arr1 if tuple(x) not in delta])
np.fromiter()
followed by np.reshape()
import numpy as np
def setdiff2d_iter(arr1, arr2):
delta = set(map(tuple, arr2))
return np.fromiter((x for xs in arr1 if tuple(xs) not in delta for x in xs), dtype=arr1.dtype).reshape(-1, arr1.shape[-1])
- NumPy's advanced indexing
def setdiff2d_idx(arr1, arr2):
delta = set(map(tuple, arr2))
idx = [tuple(x) not in delta for x in arr1]
return arr1[idx]
- Convert both inputs to
set()
(will force uniqueness of the output elements and will lose ordering):
import numpy as np
def setdiff2d_set(arr1, arr2):
set1 = set(map(tuple, arr1))
set2 = set(map(tuple, arr2))
return np.array(list(set1 - set2))
Alternatively, the advanced indexing can be built using broadcasting, np.any()
and np.all()
:
def setdiff2d_bc(arr1, arr2):
idx = (arr1[:, None] != arr2).any(-1).all(1)
return arr1[idx]
Some form of the above methods were originally suggested in @QuangHoang's answer.
A similar approach could also be implemented in Numba, following the same idea as above but using a hash instead of the actual array view arr[i]
(because of the limitations in what is supported inside a set()
by Numba) and pre-computing the output size (for speed):
import numpy as np
import numba as nb
@nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
@nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
While they all give the same result:
funcs = setdiff2d_iter, setdiff2d_list, setdiff2d_idx, setdiff2d_set, setdiff2d_bc, setdiff2d_nb
arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7]])
print(arr1)
# [[0 3]
# [0 4]
# [1 3]
# [1 7]]
arr2 = np.array([[0, 3], [1, 7], [4, 0]])
print(arr2)
# [[0 3]
# [1 7]
# [4 0]]
result = funcs[0](arr1, arr2)
print(result)
# [[0 4]
# [1 3]]
for func in funcs:
print(f'{func.__name__:>24s}', np.all(result == func(arr1, arr2)))
# setdiff2d_iter True
# setdiff2d_list True
# setdiff2d_idx True
# setdiff2d_set False # because of ordering
# setdiff2d_bc True
# setdiff2d_nb True
their performance seems to be varying:
for func in funcs:
print(f'{func.__name__:>24s}', end=' ')
%timeit func(arr1, arr2)
# setdiff2d_iter 16.3 µs ± 719 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_list 14.9 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_idx 17.8 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_set 17.5 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_bc 9.45 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_nb 1.58 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The Numba-based approach proposed seems to outperform the other approaches by a fair margin (some 10x using the given input).
Similar timings are observed with larger inputs:
np.random.seed(42)
arr1 = np.random.randint(0, 100, (1000, 2))
arr2 = np.random.randint(0, 100, (1000, 2))
print(setdiff2d_nb(arr1, arr2).shape)
# (736, 2)
for func in funcs:
print(f'{func.__name__:>24s}', end=' ')
%timeit func(arr1, arr2)
# setdiff2d_iter 3.51 ms ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_list 2.92 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_idx 2.61 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_set 3.52 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_bc 25.6 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setdiff2d_nb 192 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(As a side note, setdiff2d_bc()
is the most negatively affected by the size of the second input).