1

For some real-time DSP application I need to compute the absolute values of a complex valued vector.

The straightforward implementation would look like that

computeAbsolute (std::complex<float>* complexSourceVec,
                 float* realValuedDestinationVec,
                 int vecLength)
{
    for (int i = 0; i < vecLength; ++i)
        realValuedDestinationVec[i] = std::abs (complexSourceVec[i]);
}

I want to replace this implementation with an AVX2 optimized version, based on AVX2 instrincts. What would be the most efficient way to implement it that way?

Note: The source data is handed to me by an API I have no access to, so there is no chance to change the layout of the complex input vector for better efficiency.

PluginPenguin
  • 1,576
  • 11
  • 25
  • abs(complex) is the magnitude, the same as a 2d vector length, `sqrt(real^2 + imag^2)` (https://www.mathwarehouse.com/algebra/complex-number/absolute-value-complex-number.php). If you can lay out your data with separate arrays of `real[i]` and `imag[i]`, it will SIMD more efficiently without any shuffling, for all the same reasons as with XY direction vectors. – Peter Cordes Dec 03 '18 at 10:02
  • 1
    If you can use abs_squared, omitting the square root is a big gain in efficiency. (sqrt is very slow compared to one multiply + 1 FMA.) Alternatively, `rsqrtps` and a Newton iteration instead of `sqrt` will help throughput on most CPUs if this is all you're doing in a loop ([Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](https://stackoverflow.com/q/31555260)), but if you can do this on the fly as part of doing something with the abs result that would be another way to avoid a throughput bottleneck on the FP sqrt unit. – Peter Cordes Dec 03 '18 at 10:04
  • Thank you, I know how how to compute it and I also know that a different data layout would help, however I get the source data from an API I cannot modify, so I need to stick to the shuffled layout of std::complex – PluginPenguin Dec 03 '18 at 10:06
  • 1
    If 4% error would be ok for you, you can also check out the [alpha-max+beta-min algorithm](https://en.wikipedia.org/wiki/Alpha_max_plus_beta_min_algorithm) -- This can also be vectorized quite well with SIMD. Especially, if you need to shuffle your input, you might not gain a lot with that, however. – chtz Dec 03 '18 at 13:22
  • @chtz if you don't know which max and min are a priori, alpha-max beta-min requires the same number of operations as computing the length with rsqrt and no refinement, which is much more accurate. The latency is slightly better, but the throughput is identical on most cores, which is what matters in a vector context. – Stephen Canon Dec 29 '18 at 21:01
  • 2
    @StephenCanon You are right, and alpha-max-beta-min usually even requires two additional `and`-operations (which are on different ports, though, if I see it correctly). So it is probably never worth implementing on modern architectures, unless you have problems with under-/overflow (or, as you said, you already know max and min) – chtz Dec 30 '18 at 01:14
  • From a deleted link-only answer: https://github.com/numpy/numpy/blob/14756955d9eafb823fb13eb718943756133e5936/numpy/core/src/umath/simd.inc.src#L1662 is an implementation for NumPy for CFLOAT and CDOUBLE which is reported to handle all corner cases of NaN and Inf. – Peter Cordes Jan 24 '20 at 00:40

2 Answers2

3

Inspired by the answer of Dan M. I first implemented his version with some tweaks:

First changed it to use the wider 256 Bit registers, then marked the temporary re and im arrays with __attribute__((aligned (32))) to be able to use aligned load

void computeAbsolute1 (const std::complex<float>* cplxIn, float* absOut, const int length)
{
    for (int i = 0; i < length; i += 8)
    {
        float re[8] __attribute__((aligned (32))) = {cplxIn[i].real(), cplxIn[i + 1].real(), cplxIn[i + 2].real(), cplxIn[i + 3].real(), cplxIn[i + 4].real(), cplxIn[i + 5].real(), cplxIn[i + 6].real(), cplxIn[i + 7].real()};
        float im[8] __attribute__((aligned (32))) = {cplxIn[i].imag(), cplxIn[i + 1].imag(), cplxIn[i + 2].imag(), cplxIn[i + 3].imag(), cplxIn[i + 4].imag(), cplxIn[i + 5].imag(), cplxIn[i + 6].imag(), cplxIn[i + 7].imag()};
        __m256 x4 = _mm256_load_ps (re);
        __m256 y4 = _mm256_load_ps (im);
        __m256 b4 = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (x4,x4), _mm256_mul_ps (y4,y4)));
        _mm256_storeu_ps (absOut + i, b4);
    }
}

However manually shuffling the values this way seemed like a task that could be speeded up somehow. Now this is the solution I came up with, that runs 2 - 3 times faster in a quick test compiled by clang with full optimization:

#include <complex>
#include <immintrin.h>

void computeAbsolute2 (const std::complex<float>* __restrict cplxIn, float* __restrict absOut, const int length)
{    
    for (int i = 0; i < length; i += 8)
    {
        // load 8 complex values (--> 16 floats overall) into two SIMD registers
        __m256 inLo = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i    ));
        __m256 inHi = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i + 4));

        // seperates the real and imaginary part, however values are in a wrong order
        __m256 re = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (2, 0, 2, 0));
        __m256 im = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (3, 1, 3, 1));

        // do the heavy work on the unordered vectors
        __m256 abs = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (re, re), _mm256_mul_ps (im, im)));

        // reorder values prior to storing
        __m256d ordered = _mm256_permute4x64_pd (_mm256_castps_pd(abs), _MM_SHUFFLE(3,1,2,0));
        _mm256_storeu_ps (absOut + i, _mm256_castpd_ps(ordered));
    }
}

I think I'll go with that implementation if no one comes up with a faster solution

This compiles efficiently with gcc and clang (on the Godbolt compiler explorer).

PluginPenguin
  • 1,576
  • 11
  • 25
  • You should be able to use `vpermpd ymm,ymm, imm8` for an immediate 64-bit shuffle, instead of needing a shuffle-control in a vector. You might need some cast intrinsics to make that happen (to `__m256d` and back). It probably won't make a speed difference on Intel CPUs, but on Ryzen `vpermps` has half the throughput of `vpermpd`. (https://agner.org/optimize/). So it's not worse there, and saves a vector register + loading a vector constant. – Peter Cordes Dec 03 '18 at 12:25
  • Could you elaborate on how a shuffle implemented the way you proposed would be implemented? – PluginPenguin Dec 03 '18 at 12:48
  • You're using `_mm256_permutevar8x32_ps` to reorder pairs of 32-bit elements. You could do exactly the same thing with a shuffle that operates on 64-bit elements. [`_mm256_permute4x64_pd( _mm256_castps_pd(abs), _MM_SET(3,1,2,0))`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,SVML,Other&expand=2403,4147,4147&text=4x64_pd), with a cast back to `__m256` around that. So it will simply compile to a `vpermpd` immediate instead of `vpermps`. – Peter Cordes Dec 03 '18 at 13:30
  • Alright, got it, I tried it with _mm256p_permute_pd which didn't work but this is indeed a nice solution – PluginPenguin Dec 03 '18 at 13:36
  • 1
    Intel's intrinsic search page can match on asm mnemonic. So you can type in `vpermpd` in https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,SVML,Other&text=ex_pd&expand=2403 and find it. (And yeah, Intel's intrinsics have annoyingly confusing names, thanks to the order they evolved in. Plain `permute` is AVX1 in-lane permute, `vpermilpd`. The asm mnemonic is forward-looking enough to already describe it's in-lane behaviour. That's one reason I mostly remember the asm mnemonics and look up the intrinsics when needed.) – Peter Cordes Dec 03 '18 at 13:46
1

It's really hard (if possible) to write "highly optimized AVX2" version of complex abs since the way complex numbers are defined in the standard prevents (specifically due to all inf/nan corner cases) a lot of optimization.

However, if you don't care about the correctness you can just use -ffast-math and some compilers would optimize the code for you. See gcc output: https://godbolt.org/z/QbZlBI

You can also take this output and create your own abs function with inline assembly. But yes, as was already mentioned, if you really need performance, you probably want to swap std::complex for something else.

I was able to get a decent output for your specific case with all the required shuffles by manually filling small re and im arrays. See: https://godbolt.org/z/sWAAXo This could be trivially extended for ymm registers.

Anyway, here is the ultimate solution adapted from this SO answer which uses intrinsics in combination with clever compiler optimizations:

#include <complex>
#include <cassert>
#include <immintrin.h>

static inline void cabs_soa4(const float *re, const float *im, float *b) {
    __m128 x4 = _mm_loadu_ps(re);
    __m128 y4 = _mm_loadu_ps(im);
    __m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
    _mm_storeu_ps(b, b4);
}

void computeAbsolute (const std::complex<float>* src,
                 float* realValuedDestinationVec,
                 int vecLength)
{
    for (int i = 0; i < vecLength; i += 4) {
        float re[4] = {src[i].real(), src[i + 1].real(), src[i + 2].real(), src[i + 3].real()};
        float im[4] = {src[i].imag(), src[i + 1].imag(), src[i + 2].imag(), src[i + 3].imag()};
        cabs_soa4(re, im, realValuedDestinationVec);
    }
}

which compiles to simple

_Z15computeAbsolutePKSt7complexIfEPfi:
        test    edx, edx
        jle     .L5
        lea     eax, [rdx-1]
        shr     eax, 2
        sal     rax, 5
        lea     rax, [rdi+32+rax]
.L3:
        vmovups xmm0, XMMWORD PTR [rdi]
        vmovups xmm2, XMMWORD PTR [rdi+16]
        add     rdi, 32
        vshufps xmm1, xmm0, xmm2, 136
        vmulps  xmm1, xmm1, xmm1
        vshufps xmm0, xmm0, xmm2, 221
        vfmadd132ps     xmm0, xmm1, xmm0
        vsqrtps xmm0, xmm0
        vmovups XMMWORD PTR [rsi], xmm0
        cmp     rax, rdi
        jne     .L3
.L5:
        ret

https://godbolt.org/z/Yu64Wg

Dan M.
  • 3,818
  • 1
  • 23
  • 41
  • gcc inlines it to mul / fma + `vsqrtps`, but doesn't auto-vectorize. clang doesn't even inline even with `-ffast-math`. https://godbolt.org/z/TBPYd2. Even adding `__restrict` didn't help auto-vectorize. I guess gcc and clang aren't up to the task of doing the shuffling. – Peter Cordes Dec 03 '18 at 10:07
  • @PeterCordes true. I was able to get vectorized output out of clang after changing the layout: https://godbolt.org/z/naseXk – Dan M. Dec 03 '18 at 10:14
  • Yeah, that's unsurprising because it's very easy to auto-vectorize those simple vertical operations that can map 1:1 into asm, but unfortunately the OP is only interested in an answer for the interleaved layout. – Peter Cordes Dec 03 '18 at 10:16
  • @PeterCordes well, in that case there is not much he can do, besides adding some gathers/shuffles and benchmarking everything to make sure there is a benefit in such vectorizing. Perhaps, even converting given `std::complex` to the different layout first, and then computing abs on it would be more efficient. – Dan M. Dec 03 '18 at 10:21
  • That's exactly what this question is asking for: a manually vectorized loop. `sqrt` is expensive enough that I'm sure there will be benefit in vectorizing with a shuffle. `vsqrtps xmm` has the same performance as `sqrtss` on Intel and AMD, but does 4x as many SQRT operations. Probably with a `vpermps` + `vextracti128` to get two 128-bit vectors of real / imag, or 2x `vpermps` + 2x `vperm2f128` to get two 256-bit vectors. – Peter Cordes Dec 03 '18 at 10:27
  • Oh derp, right, 2x `vshufps` can create two vectors with the even / odd elements. Weird that gcc forgets about FMA for the Newton-Raphson iterations. I think I'd noticed that before. Anyway, you should copy the working code into your answer, not *just* a Godbolt link. (Especially not just a shortlink that can potentially rot. Prefer using full links: hit control-L to insert them as hyperlinks with other text as the link text.) – Peter Cordes Dec 03 '18 at 10:33
  • It remains remarkable that Intel hasn't bothered to provide efficient operations for this. This is not an unusual layout; FORTRAN and C also use this array-of-pairs representation. – MSalters Dec 03 '18 at 10:43
  • @MSalters well, they have a lot of instructions that can operate on "interleaved" data (i.e. there the data is combined from two regs), which covers this case reasonable. In the end, I was able to get a rather compact assembly. See the listing in my answer. – Dan M. Dec 03 '18 at 10:56