4

Well, I have a simple problem that is giving me a headache, basically I have two two-dimensional arrays, full of [x,y] coordinates, and I want to compare the first with the second and generate a third array that contains all the elements of the first array that doesn't appear in the second. It's simple but I couldn't make it work at all. The size varies a lot, the first array can have between a thousand and 2 million coordinates, while the first array has between 1 and a thousand.

This operation will occur many times and the larger the first array, the more times it will occur

sample:

arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7], ])

arr2 = np.array([[0, 3], [1, 7]])

result = np.array([[0, 4], [1, 3]])

In Depth: Basically I have a binary image with variable resolution, it is composed of 0 and 1 (255) and I analyze each pixel individually (with an algorithm that is already optimized), but (on purpose) every time this function is executed it analyzes only a fraction of the pixels, and when it is finished it gives me back all the coordinates of these pixels. The problem is that when it executes it runs the following code:

ones = np.argwhere(img == 255) # ones = pixels array

It takes about 0.02 seconds and is by far the slowest part of the code. My idea is to create this variable once and, each time the function ends, it removes the parsed pixels and passes the new array as parameter to continue until the array is empty

Pedro Bzz
  • 65
  • 2
  • 7
  • 2
    Can you provide a minimal example of what you are trying to achieve? (i.e. sample input and expected output?) – norok2 Mar 17 '21 at 13:56
  • I'm confused, `ones = np.argwhere(img == 255)` doesn't do what you are asking. what is `arr1` and what is `arr2` in that piece of code? That said, I don't think you can optimize `np.argwhere` function by much. – Quang Hoang Mar 17 '21 at 14:41
  • @QuangHoang So, as I explained there, this is the old code, the one I'm using now, what I want to do to improve performance is to call argwhere only once and transform it into arr1, and at the end of each function it will return me arr2, and this way I intend to make the new "ones" being the difference between arr1 and arr2, so I don't have to call argwhere several times (this is the slowest part of the code) – Pedro Bzz Mar 17 '21 at 14:57
  • 1
    I think this may be an XY problem. Perhaps you should describe your original problem a bit better, and show a more complete example, with some realistic data, especially if you are after a certain performance target. – norok2 Mar 17 '21 at 18:06

2 Answers2

4

Not sure what you intend to do with the extra dimensions, as the set difference, like any filtering, is inherently losing the shape information.

Anyway, NumPy does provide np.setdiff1d() to solve this problem elegantly.


EDIT With the clarifications provided, you seems to be looking for a way compute the set difference on a given axis, i.e. the elements of the sets are actually arrays.

There is no built-in specifically for this in NumPy, but it is not too difficult to craft one. For simplicity, we assume that the operating axis is the first one (so that the element of the set are arr[i]), that only unique elements appear in the first array, and that the arrays are 2D.

They are all based on the idea that the asymptotically best approach is to build a set() of the second array and then using that to filter out the entries from the first array.

The idiomatic way to build such set in Python / NumPy is to use:

set(map(tuple, arr))

where the mapping to tuple freezes arr[i], allowing them to be hashable and hence making them available to use with set().

Unfortunately, since the filtering would produce results of unpredictable size, NumPy arrays are not the ideal container for the result.

To solve this issue, one can use:

  1. an intermediate list
import numpy as np


def setdiff2d_list(arr1, arr2):
    delta = set(map(tuple, arr2))
    return np.array([x for x in arr1 if tuple(x) not in delta])
  1. np.fromiter() followed by np.reshape()
import numpy as np


def setdiff2d_iter(arr1, arr2):
    delta = set(map(tuple, arr2))
    return np.fromiter((x for xs in arr1 if tuple(xs) not in delta for x in xs), dtype=arr1.dtype).reshape(-1, arr1.shape[-1])
  1. NumPy's advanced indexing
def setdiff2d_idx(arr1, arr2):
    delta = set(map(tuple, arr2))
    idx = [tuple(x) not in delta for x in arr1]
    return arr1[idx]
  1. Convert both inputs to set() (will force uniqueness of the output elements and will lose ordering):
import numpy as np


def setdiff2d_set(arr1, arr2):
    set1 = set(map(tuple, arr1))
    set2 = set(map(tuple, arr2))
    return np.array(list(set1 - set2))

Alternatively, the advanced indexing can be built using broadcasting, np.any() and np.all():

def setdiff2d_bc(arr1, arr2):
    idx = (arr1[:, None] != arr2).any(-1).all(1)
    return arr1[idx]

Some form of the above methods were originally suggested in @QuangHoang's answer.

A similar approach could also be implemented in Numba, following the same idea as above but using a hash instead of the actual array view arr[i] (because of the limitations in what is supported inside a set() by Numba) and pre-computing the output size (for speed):

import numpy as np
import numba as nb


@nb.njit
def mul_xor_hash(arr, init=65537, k=37):
    result = init
    for x in arr.view(np.uint64):
        result = (result * k) ^ x
    return result


@nb.njit
def setdiff2d_nb(arr1, arr2):
    # : build `delta` set using hashes
    delta = {mul_xor_hash(arr2[0])}
    for i in range(1, arr2.shape[0]):
        delta.add(mul_xor_hash(arr2[i]))
    # : compute the size of the result
    n = 0
    for i in range(arr1.shape[0]):
        if mul_xor_hash(arr1[i]) not in delta:
            n += 1
    # : build the result
    result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
    j = 0
    for i in range(arr1.shape[0]):
        if mul_xor_hash(arr1[i]) not in delta:
            result[j] = arr1[i]
            j += 1
    return result

While they all give the same result:

funcs = setdiff2d_iter, setdiff2d_list, setdiff2d_idx, setdiff2d_set, setdiff2d_bc, setdiff2d_nb

arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7]])
print(arr1)
# [[0 3]
#  [0 4]
#  [1 3]
#  [1 7]]

arr2 = np.array([[0, 3], [1, 7], [4, 0]])
print(arr2)
# [[0 3]
#  [1 7]
#  [4 0]]

result = funcs[0](arr1, arr2)
print(result)
# [[0 4]
#  [1 3]]

for func in funcs:
    print(f'{func.__name__:>24s}', np.all(result == func(arr1, arr2)))
#           setdiff2d_iter True
#           setdiff2d_list True
#            setdiff2d_idx True
#            setdiff2d_set False  # because of ordering
#             setdiff2d_bc True
#             setdiff2d_nb True

their performance seems to be varying:

for func in funcs:
    print(f'{func.__name__:>24s}', end='  ')
    %timeit func(arr1, arr2)
#           setdiff2d_iter  16.3 µs ± 719 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#           setdiff2d_list  14.9 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#            setdiff2d_idx  17.8 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#            setdiff2d_set  17.5 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#             setdiff2d_bc  9.45 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#             setdiff2d_nb  1.58 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The Numba-based approach proposed seems to outperform the other approaches by a fair margin (some 10x using the given input).

Similar timings are observed with larger inputs:

np.random.seed(42)

arr1 = np.random.randint(0, 100, (1000, 2))
arr2 = np.random.randint(0, 100, (1000, 2))
print(setdiff2d_nb(arr1, arr2).shape)
# (736, 2)


for func in funcs:
    print(f'{func.__name__:>24s}', end='  ')
    %timeit func(arr1, arr2)
#           setdiff2d_iter  3.51 ms ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#           setdiff2d_list  2.92 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#            setdiff2d_idx  2.61 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#            setdiff2d_set  3.52 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#             setdiff2d_bc  25.6 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#             setdiff2d_nb  192 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(As a side note, setdiff2d_bc() is the most negatively affected by the size of the second input).

norok2
  • 25,683
  • 4
  • 73
  • 99
  • @PedroBzz Not sure if it meets your performance requirements, but please have a look at a fast Numba-based implementation, and how it compares with a couple of other approaches. – norok2 Mar 18 '21 at 02:44
  • 1
    Note that the collision handling for the hashings should also be implemented. – norok2 Mar 22 '21 at 20:33
1

Depending on how large your arrays are. If they are not too large (few thousands), you can

  1. use broadcasting to compare each point in x to each point in y
  2. use any to check for inequality at the last dimension
  3. use all to check for matching

Code:

idx = (arr1[:,None]!=arr2).any(-1).all(1)

arr1[idx]

Output:

array([[0, 4],
       [1, 3]])

update: for longer data, you can try set and a for loop:

set_y = set(map(tuple, y))
idx = [tuple(point) not in set_y for point in x]

x[idx]
norok2
  • 25,683
  • 4
  • 73
  • 99
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • The size varies a lot, from a thousand to 2 million, I'll add that to the original post – Pedro Bzz Mar 17 '21 at 14:02
  • @PedroBzz see updated answer if it helps. – Quang Hoang Mar 17 '21 at 14:09
  • Thank you! This one works but it executes in ~0.2 seconds, I was looking for something that executes below 0.02, which is what my current code does. It is the analysis of an image, I will detail it better in the original post – Pedro Bzz Mar 17 '21 at 14:27
  • @PedroBzz That's pretty impressive performance. Maybe you can/should share your code (which is advised when posting question always). Also, if you don't care for ordering of points in `arr1`, you can turn it into set as well, which will be much faster. – Quang Hoang Mar 17 '21 at 14:28
  • @QuangHoang It [seems](https://stackoverflow.com/a/66674679/5218354) that converting both inputs to `set()` is not so fast after all, given all the conversions it needs to go through. – norok2 Mar 18 '21 at 03:05