3

I have the following snippet of code where I've used numba in order to speed up my code:

import numpy as np
from numba import jit

Sigma = np.array([
                  [1, 1, 0.5, 0.25],
                  [1, 2.5, 1, 0.5],
                  [0.5, 1, 0.5, 0.25],
                  [0.25, 0.5, 0.25, 0.25]
])
Z = np.array([0.111, 0.00658])

@jit(nopython=True)
def mean(Sigma, Z):
  return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)

print(mean(Sigma, Z))

However, numba is complaining

NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
  return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)

If I'm not mistaken (after reading this), the contiguous structure of the numpy arrays is broken due to the slicing of sub-matrices from Sigma (i.e., "Sigma[0:2, 2:4]"). Is this correct? If so is there any way to resolve this warning? I believe resolving this warning would help speed up my code which is my main goal. Thanks.

Resu
  • 187
  • 1
  • 12
  • 2
    You can copy the array with `np.ascontiguousarray`. But for small arrays this is likely not beneficial. – max9111 Mar 02 '22 at 11:20
  • 1
    Are the array small like that in practice? – Jérôme Richard Mar 02 '22 at 17:44
  • Yes, it is 2x2 in my code as well (i.e., always small). However the function (i.e., mean(...)) is carried out many times in my code. – Resu Mar 03 '22 at 11:27
  • If you are only changing the values and the arrays have always the same shape, it would be possible to solve this problem analytically and than hardcode everything. This would be significantly faster, but of course not possible if it should work for other array shapes. – max9111 Mar 03 '22 at 16:24

1 Answers1

4

You get this error because dot and inv are optimized for contiguous arrays. However, regarding the small input size, this is not a huge problem. Still, you can at least specify that the input array are contiguous using the signature 'float64[:](float64[:,::1], float64[::1])' in the decorator @jit(...). This also cause the function to be compiled eagerly.

The biggest performance issue in this function is the creation of few temporary array and the call to linalg.inv which is not designed to be fast for very small matrices. The inverse matrix can be obtained by computing a simple expression based on its determinant.

Here is the resulting code:

import numba as nb

@nb.njit('float64[:](float64[:,::1], float64[::1])')
def fast_mean(Sigma, Z):
    # Compute the inverse matrix
    mat_a = Sigma[2, 2]
    mat_b = Sigma[2, 3]
    mat_c = Sigma[3, 2]
    mat_d = Sigma[3, 3]
    invDet = 1.0 / (mat_a*mat_d - mat_b*mat_c)
    inv_a = invDet * mat_d
    inv_b = -invDet * mat_b
    inv_c = -invDet * mat_c
    inv_d = invDet * mat_a

    # Compute the matrix multiplication
    mat_a = Sigma[0, 2]
    mat_b = Sigma[0, 3]
    mat_c = Sigma[1, 2]
    mat_d = Sigma[1, 3]
    tmp_a = mat_a*inv_a + mat_b*inv_c
    tmp_b = mat_a*inv_b + mat_b*inv_d
    tmp_c = mat_c*inv_a + mat_d*inv_c
    tmp_d = mat_c*inv_b + mat_d*inv_d

    # Final dot product
    z0, z1 = Z
    result = np.empty(2, dtype=np.float64)
    result[0] = tmp_a*z0 + tmp_b*z1
    result[1] = tmp_c*z0 + tmp_d*z1
    return result

This is about 3 times faster on my machine. Note that >60% of the time is spend in the overhead of calling the Numba function and creating the output temporary array. Thus, it is probably wise to use Numba in the caller functions so to remove this overhead.

You can pass the result array as parameter so to avoid the creation of the array which is quite expensive as pointed out by @max9111. This is useful only if you could preallocate the output buffer in the caller functions (once if possible). This is nearly 6 times faster with this.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    It may also make sense to pass the output array to the function to avoid heap allocation of of tiny arrays. Additionally setting inline="always" also could make sense here, to avoid a function call completely, if called from another Numba function. Division by zero checking could be also avoided by using error_modell="numpy". fastmath=True maybe also worth a try. – max9111 Mar 03 '22 at 21:10
  • Good catch for the output! The time is improved from about 900 ns to about 500 ns. The rest does not have a significant impact and could cause issues (fasthmath with nan and possibly numerical accuracy, and the division by zero check is cheap and relatively useful here to prevent issues). For the inlining, Clang should do that already. inline="always" force the inlining at the Numba level which is not critical here, but using Numba in the caller function is. Thank you. – Jérôme Richard Mar 03 '22 at 21:34
  • LLVM might do that. When calling this from another Numba function (haven't checked thoroughly for dead code elimination to be honest), I get for 1000 calls 1.2 ms for the previos version and 35 microseconds for the optimized version. – max9111 Mar 03 '22 at 22:01
  • Respectively with and without the inlining? This seems really huge just for the removal of a function call. A function call on my machine in such a case should take no more than few ns and not much more compared to the current 500 ns (though the computation should take no more than dozens of ns). Constant propagation and other similar optimizations can be applied though but none of them can explain such a difference. Note that LLVM can remove function calls (or any piece of code) which does not have any visible impact. Common sub-expression elimination can also cause some benchmark bias. – Jérôme Richard Mar 03 '22 at 22:16
  • Benchmarking: https://pastebin.com/BApP7VDc The function call overhead from Python definitely plays the most import role. 2) Memory allocation is more costly than the whole calculation. (Maybe stack-allocated arrays can also be used, but are not officially supported right now) – max9111 Mar 03 '22 at 23:06
  • @max9111 `Sigma` and `Z` does not change in the loop so the compiler can optimize it. LLVM is designed to optimize such a case and it should compute the determinant only once (as well as `tmp_a`, `tmp_b`, etc.). All the values should be fetched and stored in registers only once for the 1000 iterations. In fact the compiler could almost completely remove the loop as `i` is unused, but because fastmath is not used it need to perform 1000 times the `out[0]+out[1]` (since floating-point math is not associative). The the benchmark does certainly not measure what you want. – Jérôme Richard Mar 04 '22 at 12:32
  • @max9111 In practice compilers tends to not optimize allocations because this is quite difficult to do. LLVM can do that but it often does not when used in Numba via LLVM-Lite some some unknown reason. Thus the measured time should be the one of 1000 allocations and sum (that is not so fast due to the dependency chain and the latency of the FMA unit). – Jérôme Richard Mar 04 '22 at 12:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242593/discussion-between-max9111-and-jerome-richard). – max9111 Mar 04 '22 at 12:39