How to use numba optimally accross multiple functions?

Question

Let's say I have two functions

def my_sub1(a):
    return a + 2

def my_main(a):
    a += 1
    b = mysub1(a)
    return b

and I want to make them faster using a just-in-time compiler like Numba. Is this going to be slower than if I refactor everything into one function

def my_main(a):
    a += 1
    b = a + 2
    return b

because Numba can to deeper optimizations in the second case? Of course my real functions are quite a bit more complex.

Also this whole situation get more difficult if a my_sub1 function get's called more than once - refactoring (and maintaining would become a drag). How does Numba solve this issue?

What is the type of `a` in practice? What is your version of Numba? I am unable to reproduce the problem with Numba 0.53 with `a` an array containing 1M of `float64`: both take exactly the same time. Please provide a minimal reproducible example. — Jérôme Richard, Apr 24 '21 at 11:23
@JérômeRichard: My arguments (not just one `a`) are numpy arrays (the data that gets processed) and single values, including strings, integers and floats for configuration. But maybe there is a misunderstanding: It is not that I did an experiment and one case was faster than the other. Instead, I am asking about how numba works, specifically how it works if I would have the frist case (with two functions). I want to know if the second case is faster in general, and by principle (not specifically and by experiment). — Make42, Apr 24 '21 at 13:06

score 2 · Accepted Answer · answered Apr 25 '21 at 16:23

Tl;dr: Numba is able to inline other Numba functions and it performs relatively advanced inter-procedural optimizations only when using native types (both functions are equally fast in this case), but not with Numpy arrays.

You can analyze the resulting assembly code produced by Numba to check how the two functions are optimized. Here is an example with an integer:

import numba as nb

@nb.njit('int64(int64)')
def my_sub1(a):
    return a + 2

@nb.njit('int64(int64)')
def my_main(a):
    a += 1
    b = my_sub1(a)
    return b

open('my_sub1.asm', 'w').write(list(my_sub1.inspect_asm().values())[0])
open('my_main.asm', 'w').write(list(my_main.inspect_asm().values())[0])

This produces two assembly files. If you compare the two file, you will see that the only actual difference (beside the different names) is that the first do addq $2, %rdx while the second do addq $3, %rdx. This means that Numba succeed to inline the call to my_sub1 in my_main and merge the summations. Here is the important part of the assembly code:

_ZN8__main__12my_sub1$2413Ex:
    addq    $2, %rdx
    movq    %rdx, (%rdi)
    xorl    %eax, %eax
    retq

_ZN8__main__12my_main$2414Ex:
    addq    $3, %rdx
    movq    %rdx, (%rdi)
    xorl    %eax, %eax
    retq

With 64-bit floats, the result is the same as long as you use fastmath=True since the floating-point addition is not associative.

Regarding Numpy arrays, the generated code gets huge and this is very difficult to compare the two codes. However, the my_sub1 function does not seems inlined anymore and Numba does not seem able to merge the Numpy computation (two distinct vectorized loops for the two array summation are present in the generated code). Note that this is similar to what many C/C++ compiler does. As a result, it is probably better to inline functions yourself in performance-critical part of your code.

You could also use `inline=always` to force inline functions at Numba level. For function working on arrays it is usually important to enable loop fusion, allocation hoisting and other stuff too. This can be accomplished using `parallel=True` and `nb.parfor.sequential_parfor_lowering = True` to get optimized single-threaded code. example: https://stackoverflow.com/a/58381610/4045774 But for more complicated examples doing it manually is preferable. — max9111, Apr 26 '21 at 10:32

How to use numba optimally accross multiple functions?

1 Answers1