Reversing the view of a numpy array twice in a jitted function makes the function run faster

Question

So I was testing the speeds of two versions of the same function; one with reversing the view of a numpy array twice and one without. The code is as follows:

import numpy as np
from numba import njit

@njit
def min_getter(arr):

    if len(arr) > 1:
        result = np.empty(len(arr), dtype = arr.dtype)
        local_min = arr[0]
        result[0] = local_min

        for i in range(1,len(arr)):
            if arr[i] < local_min:
                local_min = arr[i]
            result[i] = local_min
        return result

    else:
        return arr

@njit
def min_getter_rev1(arr1):

    if len(arr1) > 1:
        arr = arr1[::-1][::-1]
        result = np.empty(len(arr), dtype = arr.dtype)
        local_min = arr[0]
        result[0] = local_min

        for i in range(1,len(arr)):
            if arr[i] < local_min:
                local_min = arr[i]
            result[i] = local_min
        return result

    else:
        return arr1
size = 500000
x = np.arange(size)   
y = np.hstack((x[::-1], x))

y_min = min_getter(y)
yrev_min = min_getter_rev1(y)

Surprisingly, the one with an extra operation runs slightly faster on multiple occasions. I used %timeit around 10 times on both functions; tried different size of the array, and the difference is apparent(at least in my computer). The runtime of min_getter is around:

2.35 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)(sometimes it is 2.33 and sometimes it is 2.37 but never goes below 2.30)

and the runtime of min_getter_rev1 is around:

2.22 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)(sometimes it is 2.25 and sometimes it is 2.23, but rarely goes above 2.30)

Any ideas on why and how this happened? The speed difference is like 4-6% increase, which can be a big deal in some applications. The underlying mechanism fo the speed-up may help speeding up some jitted codes potentially

Note1: I've tried size=5000000 and tested 5-10 times on each function, and the difference is even more apparent. The faster one runs at 23.2 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) and the slower one is at 24.4 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note2: The versions of numpy and numba during the tests are 1.16.5 and 0.45.1; python version is 3.7.4; IPython version is 7.8.0; Python IDE used is spyder. The test results may differ in different versions.

I don't think this has anything to do with reversing, but rather with the fact you are creating an array view inside your function. I get the same results replacing `[::-1][::-1]` with `[:]` (a simpler way to create a view). The difference you get is pretty small anyway — Brenlla, Sep 22 '19 at 09:46
@Brenlla Thank you for your reply. Is there any general conditions under which creating an array view will speed up the function? — mathguy, Sep 22 '19 at 09:50

MSeifert · Accepted Answer · 2019-09-22T18:15:50.180

TL;DR: It's probably just a lucky coincidence that the second code was faster.

Inspecting the generated types reveals there's one important difference:

In the first example your arr is typed as array(int32, 1d, C) a C-contiguous array.

min_getter.inspect_types()

min_getter (array(int32, 1d, C),)  <--- THIS IS THE IMPORTANT LINE
--------------------------------------------------------------------------------
# File: <>
# --- LINE 4 --- 
# label 0

@njit

# --- LINE 5 --- 

def min_getter(arr):

[...]

In the second example the arr is typed as array(int32, 1d, A), an array without knowledge if it is contiguous. That's because [::-1] returns an array without information of the contiguity and once it's lost it cannot be restored by a second [::-1].

>>> min_getter_rev1.inspect_types()

[...]

    # --- LINE 18 --- 
    #   arr1 = arg(0, name=arr1)  :: array(int32, 1d, C)
    #   $const0.2 = const(NoneType, None)  :: none
    #   $const0.3 = const(NoneType, None)  :: none
    #   $const0.4 = const(int, -1)  :: Literal[int](-1)
    #   $0.5 = global(slice: <class 'slice'>)  :: Function(<class 'slice'>)
    #   $0.6 = call $0.5($const0.2, $const0.3, $const0.4, func=$0.5, args=(Var($const0.2, <> (18)), Var($const0.3, <> (18)), Var($const0.4, <> (18))), kws=(), vararg=None)  :: (none, none, int64) -> slice<a:b:c>
    #   del $const0.4
    #   del $const0.3
    #   del $const0.2
    #   del $0.5
    #   $0.7 = static_getitem(value=arr1, index=slice(None, None, -1), index_var=$0.6)  :: array(int32, 1d, A)
    #   del arr1
    #   del $0.6
    #   $const0.8 = const(NoneType, None)  :: none
    #   $const0.9 = const(NoneType, None)  :: none
    #   $const0.10 = const(int, -1)  :: Literal[int](-1)
    #   $0.11 = global(slice: <class 'slice'>)  :: Function(<class 'slice'>)
    #   $0.12 = call $0.11($const0.8, $const0.9, $const0.10, func=$0.11, args=(Var($const0.8, <> (18)), Var($const0.9, <> (18)), Var($const0.10, <> (18))), kws=(), vararg=None)  :: (none, none, int64) -> slice<a:b:c>
    #   del $const0.9
    #   del $const0.8
    #   del $const0.10
    #   del $0.11
    #   $0.13 = static_getitem(value=$0.7, index=slice(None, None, -1), index_var=$0.12)  :: array(int32, 1d, A)
    #   del $0.7
    #   del $0.12
    #   arr = $0.13  :: array(int32, 1d, A)  <---- THIS IS THE IMPORTANT LINE
    #   del $0.13

    arr = arr1[::-1][::-1]

[...]

(The rest of the generated code is almost identical)

Indexing and iterating should be faster if the array is known to be contiguous. But that's not what we observe in this case - quite the contrary.

So what could be the reason?

Numba itself uses LLVM to "compile" the jitted code. So there's an actual compiler involved and compilers can make optimizations. Even though the code inspected by inspect_types() is almost identical the actual LLVM/ASM code is quite different inspect_llvm() and inspect_asm(). So the compiler (or numba) was able to make some kind of optimization in the second case that wasn't possible in the first case. Or some optimization that was applied to the first case was actually making the code worse.

However that means we just "got lucky" in the second case. It's probably not something that can be controlled because it depends on:

the types that numba creates based on your source,
the source code that numba uses internally to operate on these types
the generated LLVM from these types and the numba source and
the generated ASM from that LLVM.

These are too many moving parts that could apply optimizations (or not).

Fun fact: If you throw away the outer ifs:

import numpy as np
from numba import njit

@njit
def min_getter(arr):
    result = np.empty(len(arr), dtype = arr.dtype)
    local_min = arr[0]
    result[0] = local_min

    for i in range(1,len(arr)):
        if arr[i] < local_min:
            local_min = arr[i]
        result[i] = local_min
    return result

@njit
def min_getter_rev1(arr1):
    arr = arr1[::-1][::-1]
    result = np.empty(len(arr), dtype = arr.dtype)
    local_min = arr[0]
    result[0] = local_min

    for i in range(1,len(arr)):
        if arr[i] < local_min:
            local_min = arr[i]
        result[i] = local_min
    return result

size = 500000
x = np.arange(size)   
y = np.hstack((x[::-1], x))

y_min = min_getter(y)
yrev_min = min_getter_rev1(y)

%timeit min_getter(y)      # 2.29 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit min_getter_rev1(y) # 2.37 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In that case the one without the [::-1][::-1] is faster.

So if you want to make it reliably faster: Move the if len(arr) > 1 check outside the function and don't use [::-1][::-1] because in most cases that will make the function run slower (and less readable)!

Thanks for your very detailed answer. I was thinking if making some arrays c-contiguous will make some codes faster,but after seeing the conclusion of the "luck" involved, I am not sure what to do anymore. — mathguy, Sep 22 '19 at 21:58
@mathguy Don't worry too much. In general a normal approach (without hacks like `[::-1][::-1]`) will be fast enough and the difference between contiguous arrays and A-contiguous arrays is small enough to be negligible. Even 5-10% isn't really worth bothering. If you really need to squeeze out everything don't use numba - it's just a code-generation tool which makes writing "fast enough code" really, really simple. Typically you can get much more significant speed-ups (1.5-2 times faster) by using C directly - but that requires a lot more skill and effort. — MSeifert, Sep 22 '19 at 22:18
Thanks for your recommendation. Yes I'd like to squeeze out every last bit of performance(The actual code I want to speed up is here https://stackoverflow.com/questions/58046739/amortized-o1-rolling-minimum-implemented-in-python-numba-numpy? ). But as of now I don't know C and even if I do, I don't know how to make a C function call from a python environment. I guess I have to stick to `numba` and `numpy`, or maybe `tensorflow` for larger arrays for quite a while — mathguy, Sep 22 '19 at 22:25

Reversing the view of a numpy array twice in a jitted function makes the function run faster

1 Answers1