5

When using numba and accessing elements in multiple 2d numpy arrays, is it better to use the index or to iterate the arrays directly, because I'm finding that a combination of the two is the fastest which seems counterintuitive to me? Or is there another better way to do it?

For context, I am trying to speed up the implementation of the raytracing approach in this paper https://iopscience.iop.org/article/10.1088/1361-6560/ac1f38/pdf.

I have a function which takes the intensity before propagation and the displacement maps that result from the propagation. The resulting intensity is then the original intensity displaced by the displacement maps pixel by pixel with sub-pixel displacements being proportionately shared between the respective adjacent pixels. On a side note, can this be implemented directly in numpy or in another library, as I've noticed it is similar to opencv's remap function.

import numpy as np
from numba import njit

@njit
def raytrace_range(intensity_0, d_y, d_x):
    """

    Args:

        intensity_0 (2d numpy array): intensity before propagation
        d_y (2d numpy array): Displacement along y in pixels
        d_x (2d numpy array): Displacement along x in pixels

    Returns:
        intensity_z (2d numpy array): intensity after propagation 

    """
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i in range(n_x):
        for j in range(n_y):
            i_ij = intensity_0[i, j]
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]


            # Always the same from here down
            if not dx_ij and not dy_ij:
                intensity_z[i,j]+=i_ij
                continue
            i_new=i
            j_new=j
            #Calculating displacement bigger than a pixel
            if np.abs(dx_ij)>1:
                x = np.floor(dx_ij)
                i_new=int(i+x)
                dx_ij=dx_ij-x
            if np.abs(dy_ij)>1:
                y = np.floor(dy_ij)
                j_new=int(j+y)
                dy_ij=dy_ij-y
            # Calculating sub-pixel displacement
            if 0<=i_new and i_new<n_y and 0<=j_new and j_new<n_x:
                intensity_z[i_new,j_new]+=i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))
                if i_new<n_y-1 and dx_ij>=0:
                    if j_new<n_y-1 and dy_ij>=0:
                        intensity_z[i_new+1, j_new]+=i_ij*dx_ij*(1-dy_ij)
                        intensity_z[i_new+1, j_new+1]+=i_ij*dx_ij*dy_ij
                        intensity_z[i_new, j_new+1]+=i_ij*(1-dx_ij)*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new+1,j_new]+=i_ij*dx_ij*(1-np.abs(dy_ij))
                        intensity_z[i_new+1,j_new-1]+=i_ij*dx_ij*np.abs(dy_ij)
                        intensity_z[i_new,j_new-1]+=i_ij*(1-dx_ij)*np.abs(dy_ij)
                if i_new and dx_ij<0:
                    if j_new<n_x-1 and dy_ij>=0:
                        intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-dy_ij)
                        intensity_z[i_new-1,j_new+1]+=i_ij*np.abs(dx_ij)*dy_ij
                        intensity_z[i_new,j_new+1]+=i_ij*(1-np.abs(dx_ij))*dy_ij
                    if j_new and dy_ij<0:
                        intensity_z[i_new-1,j_new]+=i_ij*np.abs(dx_ij)*(1-np.abs(dy_ij))
                        intensity_z[i_new-1,j_new-1]+=i_ij*dx_ij*dy_ij
                        intensity_z[i_new,j_new-1]+=i_ij*(1-np.abs(dx_ij))*np.abs(dy_ij)
    return intensity_z

I've tried a few other approaches of which this is the fastest (includes the code from above after the comment # Always the same from here down which I've omitted to keep the question relatively short):

@njit
def raytrace_enumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, i_i in enumerate(intensity_0):
        for j, i_ij in enumerate(i_i):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]
@njit
def raytrace_npndenumerate(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for (i, j), i_ij  in np.ndenumerate(intensity_0):
            dx_ij=d_x[i,j]
            dy_ij=d_y[i,j]
@njit
def raytrace_zip(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(zip(intensity_0, d_y, d_x)):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
@njit
def raytrace_stack1(idydx):
    n_y, _, n_x = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, (i_i, dy_i, dx_i) in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(zip(i_i, dy_i, dx_i)):
@njit
def raytrace_stack2(idydx):
    n_y, n_x, _ = idydx.shape
    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)
    for i, k in enumerate(idydx):
        for j, (i_ij, dy_ij, dx_ij) in enumerate(k):

Make up some test data and time:

import timeit
rng = np.random.default_rng()
size = (2010, 2000)
margin = 10
test_data = np.pad(10000*rng.random(size=size), margin)
dx = np.pad(10*(rng.random(size=size)-0.5), margin)
dy = np.pad(10*(rng.random(size=size)-0.5), margin)

# Check results are the same
L = [raytrace_range(test_data, dy, dx), raytrace_enumerate(test_data, dy, dx), raytrace_npndenumerate(test_data, dy, dx), raytrace_zip(test_data, dy, dx), raytrace_stack1(np.stack([test_data, dy, dx], axis=1)), raytrace_stack2(np.stack([test_data, dy, dx], axis=2))]
print((np.diff(np.vstack(L).reshape(len(L),-1),axis=0)==0).all())

%timeit raytrace_range(test_data, dy, dx)
%timeit raytrace_enumerate(test_data, dy, dx)
%timeit raytrace_npndenumerate(test_data, dy, dx)
%timeit raytrace_zip(test_data, dy, dx)
%timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays were pre-stacked
%timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))

Output:

True
40.4 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.5 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
38.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
42 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) #Note this would be the fastest if the arrays were pre-stacked
47.4 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Nin17
  • 2,821
  • 2
  • 4
  • 14
  • 1
    Up voting the question. It is not common to see well elaborated questions. Some time ago numba could benefit from declaring variable types, I am not sure if that is still the case. – Zaero Divide Jun 01 '22 at 15:34
  • BTW it seems you are filling the array with zeros but you overwrite them later. If that is the case use `np.empty` or `np.empty_like` it is faster than `np.zeros` – Zaero Divide Jun 01 '22 at 15:38
  • @ZaeroDivide do you mean with type hinting or assert statements? Also for ```np.empty``` it only gives the same result if you set all elements of the array which I'm not guaranteed to be doing and it's only noticeably faster on macOS https://stackoverflow.com/questions/72449147/speed-up-the-initialization-of-3d-matrices-in-numpy/72449325?noredirect=1#comment128000789_72449325 – Nin17 Jun 01 '22 at 16:12
  • @ZaeroDivide For the types, it is interesting to provide the signature to Numba so it can compile the function eagerly instead of lazily. Once the function is compiled, this has no impact. The Numba JIT (LLVM-Lite) does not need more information thanks to the type inference. – Jérôme Richard Jun 01 '22 at 16:28
  • @ZaeroDivide For `np.empty` VS `np.zeros` it is a bit complex in practice though `np.empty` should never be slower than `np.zeros`. See this [recent answer](https://stackoverflow.com/questions/72449147/speed-up-the-initialization-of-3d-matrices-in-numpy/72452448#72452448) which explain why. – Jérôme Richard Jun 01 '22 at 16:29
  • Thanks @Nin17 and @Jérome-richard. Indeed, In my tests `empty` was always slightly faster than `zeros`. My type confusion about data types probably is both for compilation and because I was mixing it with Cython... my fault. It works when pre-compiling the code and keeping it as a module. – Zaero Divide Jun 01 '22 at 17:20
  • Just for curiosity, I tried to cythonize the code. With no python variables (no yellow in the output report) it is **outstandingly slower than numba at 100ms**. I did the whole shebang: cdefs, memoryviews, disable bound checking... I trimmed the code to see where it start slowing down. Just adding `intensity_z[i_new,j_new] += i_ij*(1-abs(dx_ij))*(1-abs(dy_ij))` before the `if` it already lows the code to 80ms from 13µs – Zaero Divide Jun 01 '22 at 20:51
  • Actually, it doesn't matter if it is before or after the first if, and I bypass the rest with `continue`. The result is pretty much the same 80ish ms – Zaero Divide Jun 01 '22 at 21:00
  • @ZaeroDivide This is certainly because of flags or the chosen compiler. Please read [Why is numba so fast?](https://stackoverflow.com/questions/70297011/why-is-numba-so-fast/70297999#70297999) – Jérôme Richard Jun 01 '22 at 22:10
  • @JérômeRichard indeed. I blame it to vectorisation in numba. AFAIK I have the compiler flags already optimized in cython. I am not sure my processor has AVX512. I will have to check tomorrow! – Zaero Divide Jun 02 '22 at 00:20
  • @JérômeRichard I was able to optimize the compilation in cython. I am using MSVC\14.29.30037, so the flags are different. Indeed, my processor does not have AVX512 but enabling AVX2 it made a huge difference in my tests. – Zaero Divide Jun 03 '22 at 19:46
  • @ZaeroDivide Ok. Note that MSVC is not great for auto-vectorization but it is the default supported compiler for Cython on Windows. GCC can be used using a custom (experimental) setup. For AVX512, most desktop does not have this instruction set (mainly Intel IceLake/TigerLake/CannonLake processors so far), but most Intel server processor use it already (Skylake-X). AMD does not yet implement AVX512 (Zen 4 planned to be released in the end of the year should support it). – Jérôme Richard Jun 04 '22 at 00:22

2 Answers2

2

In general, the fastest way to iterate over an array is a basic low-level integer iterator. Such a pattern cause the minimum number of transformation in Numba so the compiler should be able to optimize the code pretty well. Functions likes zip and enumerate often add an additional overhead due to indirect code transformations that are not perfectly optimized out.

Here is a basic example:

import numba as nb

@nb.njit('(int_[::1],)')
def test(arr):
    s1 = s2 = 0
    for i in range(arr.shape[0]):
        s1 += i
        s2 += arr[i]
    return (s1, s2)

arr = np.arange(200_000)
test(arr)

However, things are more complex when you read/write to multiple arrays simultaneously (which is your case). Indeed, Numpy array can be indexed with negative indices so Numba need to perform bound checking every time. This check is expensive compared to the actual access and it can even break some other optimizations (eg. vectorization).

Consequently, Numba has been optimized so to analyse the code and detect cases where bound checking is not needed and prevent adding expensive checks at runtime. This is the case in the above code but not in your raytrace_range function. enumerate and enumerate+zip can help a lot to remove bound checking because Numba can easily prove that the index lies in the bound of the array (theoretically, it could prove this for raytrace_range but the current implementation is unfortunately not smart enough). You can mostly solve this problem using assertions. It is not only good for optimization but also to make your code more robust!

Moreover, the indexing of multidimensional arrays is sometimes not perfectly optimized by the underlying JIT (LLVM-Lite). There is no reason for them not to be optimized but compiler use heuristics to optimize the code that are far from being perfect (though pretty good in average). You can help by computing views of lines. This generally result in a tiny improvement though.

Here is the improved code:

@njit
def raytrace_range_opt(intensity_0, d_y, d_x):
    n_y, n_x = intensity_0.shape
    assert intensity_0.shape == d_y.shape
    assert intensity_0.shape == d_x.shape

    intensity_z = np.zeros((n_y, n_x), dtype=np.float64)

    for i in range(n_x):
        row_intensity_0 = intensity_0[i, :]
        row_d_x = d_x[i, :]
        row_d_y = d_y[i, :]

        for j in range(n_y):
            assert j >= 0  # Crazy optimization (see later)

            i_ij = row_intensity_0[j]
            dx_ij = row_d_x[j]
            dy_ij = row_d_y[j]

            # Always the same from here down
            if not dx_ij and not dy_ij:
                row_intensity_0[j] += i_ij
                continue

            # Remaining code left unmodified

Notes

Note that I think the indexing of the function raytrace_enumerate is bogus: It should be for i in range(n_y): for j in range(n_x): instead since the access are done with intensity_0[i, j] and you wrote n_y, n_x = intensity_0.shape. Note that swaping the axis also gives correct results based on your validation function (which is suspicious).

The assert j >= 0 instruction alone results in a 8% speed up which is crazy since the integer iterator j is guaranteed to be positive if the n_x is positive which is always the case since it is a shape! This is clearly a missed optimization of Numba that LLVM-Lite cannot optimize (since LLVM-Lite does not know what is a shape and that they are always positive too). This apparent missing assumption in the Numba code causes additional bound checking (of each of the three arrays) that are pretty expensive.


Benchmark

Here are results on my machine:

raytrace_range:           47.8 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_enumerate:       38.9 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_npndenumerate:   54.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_zip:             41 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack1:          86.7 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
raytrace_stack2:          84 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

raytrace_range_opt:       38.6 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see raytrace_range_opt is the fastest implementation on my machine.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • is this not the same as my original one ```raytrace_range``` (except with ```'(int_[::1],)'```) which is outperformed by both enumerate and zip? – Nin17 Jun 01 '22 at 16:28
  • You are right. It is similar to `raytrace_range`. My answer was quite a bit generic. I added important informations about effects that are happening in your specific case and provided an optimized version still based on direct-indexing. – Jérôme Richard Jun 01 '22 at 19:24
  • From my checks I found that when removing (what seem to be) "superfluous" `if` conditions the code runs more on what is expected: `range` is faster than `enumerate`. – Zaero Divide Jun 01 '22 at 19:39
  • @ZaeroDivide `if not dx_ij and not dy_ij` is data-dependent so while it may seem to be useless here, it can be useful for the OP on non-random data. I also think the rest of the code can be optimized but the OP focused on the iteration, so I did ;) . – Jérôme Richard Jun 01 '22 at 19:44
  • I meant that (unless I missed something) the condition is "premature optimization", as it will occur naturally in the assignment `intensity_z[i_new,j_new]+=i_ij*(1-np.abs(dx_ij))*(1-np.abs(dy_ij))` when `dx_ij=0` and `dy_ij=0`. Note that I am assuming `np.abs(d{X|Y}_ij)>1` is also removed, as it is an identity function when they are below 1 – Zaero Divide Jun 01 '22 at 19:47
  • @ZaeroDivide I am not sure this is a premature optimization. If the OP use an array with a lot of zero, then it makes sense to use this condition since a huge part of the code will be skipped. The random dataset is likely not representative of the actual input the OP used. We just do not know yet if the OP actually have an input where the condition is useful. We can just guess so far. – Jérôme Richard Jun 01 '22 at 20:04
  • 1
    I agree, the ```if np.abs(dx_ij)>1 and if np.abs(dy_ij)>1``` are superfluous but in real data ```if not dx_ij and not dy_ij``` is a common occurrence. Id be interested in hearing what optimisations you think are possible with the rest of the code, I've already changed the indexing from ```[i, j]``` to ```[i][j]``` because I found that to be faster – Nin17 Jun 01 '22 at 20:04
  • also, I find your optimised range to give exactly the same time as my range with the new indexing, with also having removed the ```if np.abs(dx_ij)>1 and if np.abs(dy_ij)>1``` statements which have made the code almost twice as fast on my machine – Nin17 Jun 01 '22 at 20:06
  • ok no actually it is slightly faster but not as much as you are finding im getting the range function at 22.5 ms and the optimised one at 22.1 – Nin17 Jun 01 '22 at 20:10
  • Jérôme I cannot say for sure, only what I tested. Sometimes checking & branching is more expensive than just doing the operations. A clear example is what @Nin17 just pointed. I know that in general "brancheless" code is faster, as it can be preloaded and optimized by the CPU/libraries. For the ultimate test a "real data scenario" would be needed I guess – Zaero Divide Jun 01 '22 at 20:11
  • ok changing the margin of the padding to 1000 and it becomes over 2x slower without the continue condition, but actually thinking about it, I don't see how my real data would contain lots of zeros at the same index in both dx and dy, I'll keep it in mind when I actually get the real data – Nin17 Jun 01 '22 at 20:18
  • @Nin17 I am confused with your results. Indeed, in your question, best timings was close to 38-40 ms. So 22 ms seems very good. In fact, a bit too much to be true ;) . Btw, note that timings are a bit noisy and a 0.4 ms gap in this case (<2%) is certainly not statistically significant (at least clearly not on my machine). – Jérôme Richard Jun 01 '22 at 20:19
  • that speed improvement was just from removing the ```if np.abs(dx_ij)>1:``` and ```if np.abs(dy_ij)>1:``` conditions and changing indexing from [i,j] to [i][j] – Nin17 Jun 01 '22 at 20:20
  • @Nin17 Please read carefully the "note" section about the probably bug (and the crazy assertion). Bugs often impact performance though it is generally the least of your problem ;) . – Jérôme Richard Jun 01 '22 at 20:20
  • yep, I'll look into the indexing but you're probably right, as for the ```assert j >= 0``` I'm not finding that this makes any difference, maybe this is dependent on the numba version? im using 0.55.1 as for my test function, I copied that directly for the purpose of this question so I haven't tested that much – Nin17 Jun 01 '22 at 20:27
  • @Nin17 Ok. I am using the "0.54.1". The 0.55.0 version appears to update LLVM-Lite which update the version of LLVM which could certainly optimize better that. Such optimization is brittle. In fact, it is common in C that such optimization vary from one compiler to another so that a 5-10% variation between compilers are common. – Jérôme Richard Jun 01 '22 at 20:37
  • 1
    By the way, modern processors use branch prediction to be faster. This means that algorithms having a lot of conditions like your are pretty affected by this behaviour and the dataset can also strongly impact the resulting performance. This is a pretty complex topic. For more information please read [this great post](https://stackoverflow.com/questions/11227809). I agree with ZaeroDivide about testing a "real data scenario" though it can be quite difficult to setup. As for performance, note that SIMD in the key to get high performance but this is hard to make the code SIMD-friendly here. – Jérôme Richard Jun 01 '22 at 20:45
2

Edit 3: Turns out that removing if statements make range faster than enumerate. See edit 2 below

Interestingly, in my machine times get awful in the stack1 and stack2 options, and indeed enumerate seems to be fastest. Maybe thanks to enumerate numba understands it is a looping variable?:

In [1]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
   ...: %timeit raytrace_npndenumerate(test_data, dy, dx)
   ...: %timeit raytrace_zip(test_data, dy, dx)
   ...: %timeit raytrace_stack1(np.stack([test_data, dy, dx], axis=1)) #Note this would be the fastest if the arrays we
   ...: re pre-stacked
   ...: %timeit raytrace_stack2(np.stack([test_data, dy, dx], axis=2))
61 ms ± 785 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.9 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.9 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.5 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: Using fastmath=True did not shove up much time, only ~3ms

Edit 2: Although it is not related to the OP's question, after playing a bit with the functions, turns out that removing "superfluous"(*) conditional statements makes it notably faster. Around 20% on my machine. Turns out the implementation works without them (at least the supplied test returns True):

(*) The operations seem to work regardless, as they are "caught" by the lower operations. At least, provided test vector did not report any issues.

#! Using this it is faster:
# Always the same from here down
# if dx_ij==0 and dy_ij==0:
#     intensity_z[i,j]+=i_ij
#     continue
#Calculating displacement bigger than a pixel
x = np.floor(dx_ij)
i_new=int(i+x)
dx_ij=dx_ij-x
y = np.floor(dy_ij)
j_new=int(j+y)
dy_ij=dy_ij-y
# Calculating sub-pixel displacement


In [2]: %timeit raytrace_range(test_data, dy, dx)
   ...: %timeit raytrace_range2(test_data, dy, dx)
   ...: %timeit raytrace_enumerate(test_data, dy, dx)
64.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.1 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Zaero Divide
  • 699
  • 2
  • 10
  • I agree, the if statements testing ```np.abs(dx_ij)>1``` and ```np.abs(dy_ij)>1``` are superfluous - I've already removed a few such if statements from the original function but missed those, but with real data, ```dx_ij==0 and dy_ij==0``` is quite a common occurrence so it'd probably be faster to keep that – Nin17 Jun 01 '22 at 19:59
  • I would suggest to try, sometimes checking and branching will take more clock cycles than just doing the operation. Leaving aside pipelines, cache and other things that are beyond my knowledge – Zaero Divide Jun 01 '22 at 20:04