4

I'm looking for the most time-efficient way of computing the absolute squared value of a complex ndarray in python.

arr = np.empty((8, 4000), dtype="complex128")  # typical size

I have tried these options:

# numpy implementation
def abs2_numpy(x):
    return x.real**2 + x.imag**2
# numba implementation
@numba.vectorize([numba.float64(numba.complex128)])
def abs2_numba(x):
    return x.real**2 + x.imag**2

It turns out that the numba implementation is roughly 4x faster than numpy, buy I would like to know if there exist a faster method.

I have read this question which mentions several methods, but the post is oriented to memory efficiency, which is not a constrain in my case.

  • Have you tried to replace `**` with `np.square` to see if there is a little improvement? – Hamzah Mar 21 '23 at 15:58
  • Have you tried also using `np.abs(x, out=x)` instead of the computation of each part of x? – Hamzah Mar 21 '23 at 16:00
  • @Hamzah Thanks for the comment. I tried with `np.square` but it makes no difference. On the other hand, `np.abs(x, out=x)` is significantly slower, as it involves computing a squared root. – Nahuel Almeira Mar 21 '23 at 16:10
  • It thought `(x.real+x.imag)*(x.real-x.imag)` could perhaps be faster (only one multiplication) but it's not. – Alain T. Mar 21 '23 at 16:33
  • @AlainT. Thanks for trying, but note that the operation you propose is not mathematically correct, as it gives `x.real**2 - x.imag**2`. The correct version of your proposal would be to compute `(x * x.conjugate()).real`. I have tried this option but it takes roughly the same as the python implementation I wrote in the question. – Nahuel Almeira Mar 21 '23 at 16:44
  • Right, I thought x.imag**2 would negate the resulting negative from i^2 but it's no longer imaginary at that point. My bad. – Alain T. Mar 21 '23 at 16:51
  • `arr*arr.conjugate()` is another option, though it times a bit slower than your `abs2_numpy` – hpaulj Mar 21 '23 at 16:58

1 Answers1

1

The execution time of the provided function is very challenging. In this case, the best solution to find if there is a better possible implementation is to look the generated assembly code since trying to write many alternative function blindly is a wast of time if the generated assembly code is close to be optimal. In fact, this is the case here : the generated assembly code is very good regarding the input layout.

Numba generates many way to compute the array internally and chose the best at runtime based on the input. When the input array is contiguous in memory, the best implementation use the following assembly hot loop:

.LBB0_55:
    vmovupd -192(%r9,%rbp,2), %ymm0
    vmovupd -160(%r9,%rbp,2), %ymm1
    vmovupd -128(%r9,%rbp,2), %ymm2
    vmovupd -96(%r9,%rbp,2), %ymm3
    vmovupd -64(%r9,%rbp,2), %ymm4
    vmovupd -32(%r9,%rbp,2), %ymm5
    vmovupd (%r9,%rbp,2), %ymm6
    vmulpd  %ymm1, %ymm1, %ymm1
    vmulpd  %ymm0, %ymm0, %ymm0
    vhaddpd %ymm1, %ymm0, %ymm0
    vmovupd 32(%r9,%rbp,2), %ymm1
    vmulpd  %ymm3, %ymm3, %ymm3
    vmulpd  %ymm2, %ymm2, %ymm2
    vhaddpd %ymm3, %ymm2, %ymm2
    vmulpd  %ymm5, %ymm5, %ymm3
    vmulpd  %ymm4, %ymm4, %ymm4
    vhaddpd %ymm3, %ymm4, %ymm3
    vmulpd  %ymm1, %ymm1, %ymm1
    vmulpd  %ymm6, %ymm6, %ymm4
    vhaddpd %ymm1, %ymm4, %ymm1
    vpermpd $216, %ymm0, %ymm0
    vpermpd $216, %ymm2, %ymm2
    vpermpd $216, %ymm3, %ymm3
    vpermpd $216, %ymm1, %ymm1
    vmovupd %ymm0, (%r11,%rbp)
    vmovupd %ymm2, 32(%r11,%rbp)
    vmovupd %ymm3, 64(%r11,%rbp)
    vmovupd %ymm1, 96(%r11,%rbp)
    subq    $-128, %rbp
    addq    $-16, %rdx
    jne .LBB0_55

This loops is pretty efficient. Indeed, it is vectorized using the AVX-2 SIMD instruction set (the best in this case on my machine). This can be seen by looking the opcodes prefix/suffix and the operands : the v prefix is for AVX, the pd suffix is for packed double-precision operations, and the ymm are 256-bit wise AVX registers. The loop is also unrolled so the overhead of the loop iteration is negligible. I hardly believe any alternative Numba function can generate a better hot loop. In fact, I do not expect any auto-vectorized native code to be significantly faster either.

While the loop is very good, I do not think it is optimal. Indeed, vhaddpd can certainly be replaced by vertical vaddpd combined with swizzle instructions (like vshufpd, vunpckhpd and vunpcklpd as well as possibly a vpermpd). That being said, the speed up will likely be small. Indeed, the input takes 64 KiB and the output takes 32 KiB. This means the operation will certainly be done in the L2 cache which is generally slower and may be the bottleneck on a more optimized code. In fact, on my machine, a function returning just x.real takes the same time! Not to mention writing such a code in assembly or using intrinsics in native language (like C/C++) results in a non-portable code and it requires to be highly skilled (in both C/C++, assembly, and have a very good knowledge of how x86-64 processors operate). Put it shortly, I think this loop is the best you can get for a sequential Python code.

Regarding your target platform, using multiple threads can be faster. Creating threads, sharing work and waiting for them take a significant time at this granularity so it can actually be slower than the computation on some computing machines (especially computing servers with many cores). On my machine the overhead is low (few microseconds). Thus, using multiple threads is slightly faster. This is far from being great since using more threads is like wasting most of your cores due to a quite disappointing speed-up. On top of that, Numba fails to auto-vectorize the code using SIMD instructions with plain loops here (and vectorize seems not to support the parallel=True flag yet). We can speed up the function a bit more by preallocating the array once since temporary array are a bit slow to create (and fill the first time). Here is the resulting code:

@numba.njit('(complex128[:,::1],float64[:,::1])', parallel=True)
def abs2_numba(x, out):
    for i in numba.prange(x.shape[0]):
        tin = x[i, :]
        tout = out[i, :]
        for j in range(x.shape[1]):
            v = tin[j]
            real = v.real
            imag = v.imag
            tout[j] = real * real + imag * imag

out = np.empty(arr.shape)
%timeit -n 10_000 abs2_numba(x, out)

This takes 9.8 µs on my 6-core i5-9600 machine while the initial code takes 12.9 µs. Most of the time is spent in the parallel overhead and the fact that the code is not as well vectorized as the other sequential code.

I do not advise you to use this parallel implementation in your case unless every microsecond matter and you do not plan to run it on a machine with many cores. Instead, optimizing the code calling this function is certainly the best thing to do in your case.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59