The execution time of the provided function is very challenging. In this case, the best solution to find if there is a better possible implementation is to look the generated assembly code since trying to write many alternative function blindly is a wast of time if the generated assembly code is close to be optimal. In fact, this is the case here : the generated assembly code is very good regarding the input layout.
Numba generates many way to compute the array internally and chose the best at runtime based on the input. When the input array is contiguous in memory, the best implementation use the following assembly hot loop:
.LBB0_55:
vmovupd -192(%r9,%rbp,2), %ymm0
vmovupd -160(%r9,%rbp,2), %ymm1
vmovupd -128(%r9,%rbp,2), %ymm2
vmovupd -96(%r9,%rbp,2), %ymm3
vmovupd -64(%r9,%rbp,2), %ymm4
vmovupd -32(%r9,%rbp,2), %ymm5
vmovupd (%r9,%rbp,2), %ymm6
vmulpd %ymm1, %ymm1, %ymm1
vmulpd %ymm0, %ymm0, %ymm0
vhaddpd %ymm1, %ymm0, %ymm0
vmovupd 32(%r9,%rbp,2), %ymm1
vmulpd %ymm3, %ymm3, %ymm3
vmulpd %ymm2, %ymm2, %ymm2
vhaddpd %ymm3, %ymm2, %ymm2
vmulpd %ymm5, %ymm5, %ymm3
vmulpd %ymm4, %ymm4, %ymm4
vhaddpd %ymm3, %ymm4, %ymm3
vmulpd %ymm1, %ymm1, %ymm1
vmulpd %ymm6, %ymm6, %ymm4
vhaddpd %ymm1, %ymm4, %ymm1
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm3, %ymm3
vpermpd $216, %ymm1, %ymm1
vmovupd %ymm0, (%r11,%rbp)
vmovupd %ymm2, 32(%r11,%rbp)
vmovupd %ymm3, 64(%r11,%rbp)
vmovupd %ymm1, 96(%r11,%rbp)
subq $-128, %rbp
addq $-16, %rdx
jne .LBB0_55
This loops is pretty efficient. Indeed, it is vectorized using the AVX-2 SIMD instruction set (the best in this case on my machine). This can be seen by looking the opcodes prefix/suffix and the operands : the v
prefix is for AVX, the pd
suffix is for packed double-precision operations, and the ymm
are 256-bit wise AVX registers. The loop is also unrolled so the overhead of the loop iteration is negligible. I hardly believe any alternative Numba function can generate a better hot loop. In fact, I do not expect any auto-vectorized native code to be significantly faster either.
While the loop is very good, I do not think it is optimal. Indeed, vhaddpd
can certainly be replaced by vertical vaddpd
combined with swizzle instructions (like vshufpd
, vunpckhpd
and vunpcklpd
as well as possibly a vpermpd
). That being said, the speed up will likely be small. Indeed, the input takes 64 KiB and the output takes 32 KiB. This means the operation will certainly be done in the L2 cache which is generally slower and may be the bottleneck on a more optimized code. In fact, on my machine, a function returning just x.real
takes the same time! Not to mention writing such a code in assembly or using intrinsics in native language (like C/C++) results in a non-portable code and it requires to be highly skilled (in both C/C++, assembly, and have a very good knowledge of how x86-64 processors operate). Put it shortly, I think this loop is the best you can get for a sequential Python code.
Regarding your target platform, using multiple threads can be faster. Creating threads, sharing work and waiting for them take a significant time at this granularity so it can actually be slower than the computation on some computing machines (especially computing servers with many cores). On my machine the overhead is low (few microseconds). Thus, using multiple threads is slightly faster. This is far from being great since using more threads is like wasting most of your cores due to a quite disappointing speed-up. On top of that, Numba fails to auto-vectorize the code using SIMD instructions with plain loops here (and vectorize
seems not to support the parallel=True
flag yet). We can speed up the function a bit more by preallocating the array once since temporary array are a bit slow to create (and fill the first time). Here is the resulting code:
@numba.njit('(complex128[:,::1],float64[:,::1])', parallel=True)
def abs2_numba(x, out):
for i in numba.prange(x.shape[0]):
tin = x[i, :]
tout = out[i, :]
for j in range(x.shape[1]):
v = tin[j]
real = v.real
imag = v.imag
tout[j] = real * real + imag * imag
out = np.empty(arr.shape)
%timeit -n 10_000 abs2_numba(x, out)
This takes 9.8 µs on my 6-core i5-9600 machine while the initial code takes 12.9 µs. Most of the time is spent in the parallel overhead and the fact that the code is not as well vectorized as the other sequential code.
I do not advise you to use this parallel implementation in your case unless every microsecond matter and you do not plan to run it on a machine with many cores. Instead, optimizing the code calling this function is certainly the best thing to do in your case.