Fastest way to find both the minimum and maximum of an BGR array per cell in Numba?

Question

I have an array of 3 dimensional with shape (height, width, 3) representing a BGR image with values being floats between 0 and 1 (inclusive).

I need to find both the minimum and maximum channel of each pixel in one loop. If img is the array, then the effect would be the same as np.dstack([img.max(axis=-1), img.min(axis=-1)]), and I wish to do it in one loop.

I have already tried to implement it myself, and I have beaten NumPy builtin methods by far, but the efficiency is only because Numba. I think perhaps there can be more efficient ways to do this.

My idea is simple, with three numbers a, b, c, there are 6 possible relationships:

a <= b <= c
a <= c <= b
b <= a <= c
b < c < a
c < a < b
c < b < a

With 3 conditional variable swaps I have managed to find minimum and maximum in one function without calling min and max:

def extrema(a, b, c):
    if b > c:
        b, c = c, b

    if a > b:
        a, b = b, a

    if b > c:
        b, c = c, b

    return a, c

My method is faster than min, max calls:

In [9]: %timeit extrema(29, 500, 12)
235 ns ± 14.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [10]: %timeit max((29, 500, 12))
186 ns ± 7.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: %timeit min((29, 500, 12))
186 ns ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So I have implemented the following functions for benchmarking:

import numba as nb
import numpy as np

def extrema_img(img: np.ndarray) -> np.ndarray:
    ext = img.copy()
    for a, b in ((1, 2), (0, 1), (1, 2)):
        temp = ext[..., a][mask := ext[..., a] > ext[..., b]]
        ext[mask, a] = ext[mask, b]
        ext[mask, b] = temp
    return ext


def extrema_img_1(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            a, b, c = img[y, x]
            if b > c:
                b, c = c, b

            if a > b:
                a, b = b, a

            if b > c:
                b, c = c, b

            ext[y, x] = (a, c)
    return ext


@nb.njit(cache=True, fastmath=True)
def extrema_img_nb1(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            a, b, c = img[y, x]
            if b > c:
                b, c = c, b

            if a > b:
                a, b = b, a

            if b > c:
                b, c = c, b

            ext[y, x] = (a, c)
    return ext


@nb.njit(cache=True, fastmath=True)
def extrema_img_nb3(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            ext[y, x] = np.sort(img[y, x])[:3:2]
    return ext


def extrema_img_7(img: np.ndarray) -> np.ndarray:
    return np.sort(img, axis=-1)


def extrema_img_8(img: np.ndarray) -> tuple:
    return img.min(axis=-1), img.max(axis=-1)

With img = np.random.random((1080, 1920, 3)), the benchmark results are below:

In [2]: %timeit extrema_img(img)
338 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit extrema_img_1(img)
3.88 s ± 55.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit extrema_img_nb1(img)
18.6 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit extrema_img_nb1(img)
19.6 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit extrema_img_nb3(img)
1.67 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit extrema_img_7(img)
89.3 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit extrema_img_8(img)
184 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

extrema_img_nb1 is the fastest, but extrema_img_1 is much slower than the boolean masking based solution, I have tried to compile the masking based solution but it doesn't work, I guess I should post a separate question.

How can it be faster?

I've removed the [tag:python-3.x] tag. Please stop using that tag on your posts that aren't explicitly about python-3.x. — jared, Sep 02 '23 at 13:32
@jared The Walrus operator (`:=`) is a feature exclusive to Python 3.x. — Ξένη Γήινος, Sep 02 '23 at 13:33
That doesn't make this a python-3.x question. Just because a question uses `range` instead of `xrange` doesn't mean it should be tagged python-3.x. — jared, Sep 02 '23 at 13:35
The methods don't produce the same data - they don't even produce the same _array shape_. So this is not whatsoever apples-to-apples comparison. — Reinderien, Sep 02 '23 at 20:34
Regarding the compilation error, AFAIK Numba does not support `...` nor masking. Masking can often be replaced with non-masking operations. This strategy is generally inefficient because of [branch prediction](https://stackoverflow.com/a/11227902/12939557) (AFAIK Numpy does not use a branchless approach so far as opposed to `np.where` which I optimized). However, sometimes, this is the best strategy available with Numpy. — Jérôme Richard, Sep 02 '23 at 20:59
Do you have a GPU? On my machine, if I use PyTorch with a GPU instead of Numpy, `extrema_img_8` is about 2x faster than `extrema_img_1nb`. — jared, Sep 02 '23 at 20:59
Besides, please note that I added a comment in [the answer of your previous question](https://stackoverflow.com/a/77025874/12939557) explaining why Numpy is not efficient with such input layout and why the Numba code can be far faster using a better layout. It is worth reading since SIMD can outperform all micro-optimizations you can think about by a far margin. — Jérôme Richard, Sep 02 '23 at 21:00
@jared An x2 speed up is actually quite disappointing for a GPU code (ie. parallel) compared to a sequential CPU code that could be parallelized and that does not currently benefit from SIMD instructions. In fact, discrite GPUs cannot be faster than CPU for most basic image operation which are *memory-bound*. This is because the PCIe link used for transfering data is significantly slower than the host RAM. Not to mention allocating GPU data, starting a kernel and doing synchronizations is quite slow. Such a computation is known to be memory-bound with SIMD instructions on most machines. — Jérôme Richard, Sep 02 '23 at 21:06

Fastest way to find both the minimum and maximum of an BGR array per cell in Numba?

0 Answers0