How to square two complex doubles with 256-bit AVX vectors?

Question

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:

__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg  = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);

/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);

/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);

/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);  

/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);

/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);

/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:

struct cmplx a;
struct cmplx b;

a.r = 2.5341;
a.i = 1.843;

b.r = 1.3941;
b.i = 0.93;

__m256d zzs = squareZ(a, b);

double* res = (double*) &zzs;

printf("\nA: %f + %f,  B: %f + %f\n", res[0], res[1], res[2], res[3]);

Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:

A: 3.025014 + 9.340693,  B: 0.000000 + 2.593026

So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

@mtrw squareZ is just the function name I gave Matt's code - I didn't bother to include function names in my snippets, my mistake. — sacheie, Sep 15 '16 at 11:25
Can you please use the `edit` button and update the code? As things stand, nobody can cut and paste your code and run it to reproduce the problem. See http://stackoverflow.com/help/mcve. — mtrw, Sep 15 '16 at 12:01
Negating the odd element (holding the imaginary part) would be much more efficiently done with `_mm256_xor_pd` to flip the sign bits. That's one of the main use-case for the FP versions of vector XOR/AND/OR. Also note that keeping the real and imaginary parts in two separate vectors would allow much more efficient operation: 1. no shuffle to swap real/imag: that happens for free by operating on different variables. 2. fast vertical sub instead of slow hsub. 3. You can flip the sign of 4 imaginary parts per insn. Using `shuffle_pd` to do this for packed input/output might be a win. — Peter Cordes, Sep 15 '16 at 15:27
@mtrw - Apologies, but I'm at work and haven't got my exact version of the code (mine takes arguments whereas Matt's was just an example with literal values). I will update it when I get home. — sacheie, Sep 15 '16 at 20:58
@PeterCordes Yeah, I've seen way too many people stick to the old alternating real/imag representation that it makes me mad. In a lot of cases, the crappy representation is dictated at the API level or for backwards compatibility. And then people complain that the compiler can't vectorize an array of `std::complex` as well as a someone like you and I can... — Mysticial, Sep 15 '16 at 22:11
@Mysticial: Surprisingly, gcc totally shoots itself in the foot with 8 extra VPERMPD instructions when auto-vectorizing, because it wants to keep its temporaries in ABCD order, not AC BD lane-crossing order. Other than that, gcc output made a good starting point, and does slightly better than the Matt's code (copied by the OP) when you have 2 pairs of vectors to multiply. (Or significantly better with FMA). — Peter Cordes, Sep 15 '16 at 22:31
@PeterCordes I was also playing around on godbolt to see what GCC does with `std::complex`. First of all, not having `ffast-math` is asking for trouble (function call). With that and `-mavx2`, it generates 128-bit load/stores and a crazy amount of insert/extract instructions. If you do `mtune=haswell` and `restrict`, then it finally does something sane-ish. IOW compilers have a long way to go. And it's a hard problem. Of course none of that solves the data-layout inefficiency of `std::complex`. — Mysticial, Sep 15 '16 at 22:40
@Mysticial: Yeah, I noticed that, too. I was surprised it couldn't inline a complex mul without -ffast-math. It's no surprise at all the it needs `restrict`, of course. That's just par for the course, and would be expected even for separate real/imag arrays. I did notice that gcc 6.2 (but not 7) checks for overlap if you use `*restrict dst` but omit restrict from the source arrays. Hopefully gcc's alias analysis would do better with more context (like whole-program LTO, or when inlining into something, but maybe only if it operated on static data :/) — Peter Cordes, Sep 15 '16 at 22:45
@PeterCordes I believe `std::complex` follows some IEEE specification for the usual NaN corner cases - which expands out to [this monstrosity](http://opensource.apple.com/source/clang/clang-137/src/projects/compiler-rt/lib/muldc3.c). That's not even funny. — Mysticial, Sep 15 '16 at 22:53

score 15 · Answer 1 · edited Jun 20 '20 at 09:12

This answer covers the general case of multiplying two arrays of complex numbers

Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.

You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.

It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).

If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.

gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.

#include <complex.h>

// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.

// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
          const double complex *restrict A, const double complex *restrict B)
{       // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
  for (int i=0; i<4 ; i++) {
    dst[i] = A[i] * B[i];
  }
}

See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.

Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd)).

I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.

// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
          const double complex *restrict A, const double complex *restrict B)
{
                                         // low element first, little-endian style
  __m256d A0 = _mm256_loadu_pd((double*)A);    // [A0r A0i  A1r A1i ] // [a b c d ]
  __m256d A2 = _mm256_loadu_pd((double*)(A+2));                       // [e f g h ]
  __m256d realA = _mm256_unpacklo_pd(A0, A2);  // [A0r A2r  A1r A3r ] // [a e c g ]
  __m256d imagA = _mm256_unpackhi_pd(A0, A2);  // [A0i A2i  A1i A3i ] // [b f d h ]
  // the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
  
  __m256d B0 = _mm256_loadu_pd((double*)B);                           // [m n o p]
  __m256d B2 = _mm256_loadu_pd((double*)(B+2));                       // [q r s t]
  __m256d realB = _mm256_unpacklo_pd(B0, B2);                         // [m q o s]
  __m256d imagB = _mm256_unpackhi_pd(B0, B2);                         // [n r p t]

  // desired:  real=rArB - iAiB,  imag=rAiB + rBiA
  __m256d realprod = _mm256_mul_pd(realA, realB);
  __m256d imagprod = _mm256_mul_pd(imagA, imagB);
  
  __m256d rAiB     = _mm256_mul_pd(realA, imagB);
  __m256d rBiA     = _mm256_mul_pd(realB, imagA);

  // gcc and clang will contract these nto FMA.  (clang needs -ffp-contract=fast)
  // Doing it manually would remove the option to compile for non-FMA targets
  __m256d real     = _mm256_sub_pd(realprod, imagprod);  // [D0r D2r | D1r D3r]
  __m256d imag     = _mm256_add_pd(rAiB, rBiA);          // [D0i D2i | D1i D3i]

  // interleave the separate real and imaginary vectors back into packed format
  __m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000);  // [D0r D0i | D1r D1i]
  __m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111);  // [D2r D2i | D3r D3i]
  _mm256_storeu_pd((double*)dst, dst0);
  _mm256_storeu_pd((double*)(dst+2), dst2);   
}

  Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
    vmovupd         ymm0, YMMWORD PTR [rsi+32]
    vmovupd         ymm3, YMMWORD PTR [rsi]
    vmovupd         ymm1, YMMWORD PTR [rdx]
    vunpcklpd       ymm5, ymm3, ymm0
    vunpckhpd       ymm3, ymm3, ymm0
    vmovupd         ymm0, YMMWORD PTR [rdx+32]
    vunpcklpd       ymm4, ymm1, ymm0
    vunpckhpd       ymm1, ymm1, ymm0
    vmulpd          ymm2, ymm1, ymm3
    vmulpd          ymm0, ymm4, ymm3
    vfmsub231pd     ymm2, ymm4, ymm5     # separate mul/sub contracted into FMA
    vfmadd231pd     ymm0, ymm1, ymm5
    vunpcklpd       ymm1, ymm2, ymm0
    vunpckhpd       ymm0, ymm2, ymm0
    vmovupd         YMMWORD PTR [rdi], ymm1
    vmovupd         YMMWORD PTR [rdi+32], ymm0
    vzeroupper
    ret

For 4 complex multiplies (of 2 pairs of input vectors), my code uses:

4 loads (32B each)
2 stores (32B each)
6 in-lane shuffles (one for each input vector, one for each output)
2 VMULPD
2 VFMA...something
(only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)

So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.

Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).

the same 6 loads/stores
6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
4 multiplies
2 subtracts (which can't combine with the muls into FMAs)
An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with -0.0).
latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.

The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.

My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.

One vector result at a time, with FMA:

Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.

I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.

// MULTIPLY: for each AVX lane: am-bn, an+bm  
 r i  r i
 a b  c d       // input vectors: a + b*i, etc.
 m n  o p

Algorithm:

create bm bn with movshdup(a b) + mulpd
create bn bm with shufpd on the previous result. (or create n m with a shuffle before the mul)
create a a with movsldup(a b)
use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm].

Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).

Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm (overlapping with 1 cycle to create a a), + 4 (FMA). So it's one cycle lower latency than Matt's.

On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)

This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:

// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb  // movshdup + mul
// bb ab  // ^ -> shufpd

// a  a          // movsldup
// aa-bb  ab+ab  // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub

I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb, or (r-i)*(r+i) = rr - ii but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.

Wow, this is an incredibly detailed answer; I wish I could upvote it more than once. It did occur to me to store the real and imaginary parts in separate arrays for better efficiency. But what I feel may be lost (if only because so much of this information is over my head) is the fact that I was hoping the problem could be simplified because the two complex numbers merely need to be squared. I don't need the product of two arbirary complex doubles; I just need to square them both as efficiently as possible. If you've explained that and I missed it, can you please point out where? — sacheie, Sep 16 '16 at 00:25
@sacheie: oh hmm, no I forgot that was the question >.< That does maybe save some math, since the cross product is the same both ways. I'll probably post a separate answer for the squaring special case, since this one is long already. — Peter Cordes, Sep 16 '16 at 00:41
@sacheie When you read one of Peter's answers you're *supposed* to be overwhelmed. There are no exceptions - not even myself. — Mysticial, Sep 16 '16 at 15:56
@Mysticial: If I didn't write down every relevant thought I had while answering, that would feel like a waste :P Static perf analysis like this is one of the few useful ways to compare different ways to implement a short SIMD function, so other than benchmarking on CPUs I don't have, it's the only way to answer questions like "is there a better way". I'm glad people find it useful :) — Peter Cordes, Sep 16 '16 at 16:03
You're one of the few people I know who are probably capable of writing optimal code for processors that don't even exist yet. Have you ever tried that? I've been hit-and-miss on that. I nailed it for Sandy Bridge with code I wrote back in 2010. Failed miserably for AMD Piledriver due to the 256-bit store bug. I never tried Haswell since I was preoccupied with other obligations. I didn't try Skylake since the architecture was too similar to Haswell. I have code for KNL, SKX, and CNL. I have since updated the KNL code based on recently published architecture details. — Mysticial, Sep 16 '16 at 16:17
And I don't have anything for Zen yet, since virtually no details came out until very recently. — Mysticial, Sep 16 '16 at 16:17
@Mysticial: I haven't tried to tune for anything that Agner Fog hadn't written up yet. I don't usually have my hands on the latest hardware, though, so that's similar. I wrote [this AVX2+BMI2 left-pack with shuffle-mask generated on the fly with PEXT](http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask) with zero testing other than making sure it compiled to asm that looked good. The OP reported that it worked correctly :). I assume it's near-optimal on Haswell, but I haven't tried for real. — Peter Cordes, Sep 16 '16 at 16:26
Ah. Yeah, my programmer side combined with my overclocker side is too much. So I end up getting a piece of every generation. — Mysticial, Sep 16 '16 at 16:30

Peter Cordes · Accepted Answer · 2016-09-16T06:33:17.457

See my other answer for the general case of multiplying different complex numbers, not squaring.

TL:DR: just use the code in my other answer with both inputs the same. Compilers do a good job with the redundancy.

Squaring simplifies the math slightly: instead of needing two different cross products, rAiB and rBiA are the same. But it still needs to get doubled, so basically we end up with 2 mul + 1 FMA + 1 add, instead of 2 mul + 2 FMA.

With the SIMD-unfriendly interleaved storage format, it gives a big boost to the deinterleave method, since there's only one input to shuffle. Matt's method doesn't benefit at all, since it calculates both cross products with the same vector multiply.

Using the cmul_manualvec() from my other answer:

// squares 4 complex doubles from A[0..3], storing the result in dst[0..3]
void csquare_manual(double complex *restrict dst,
          const double complex *restrict A) {
  cmul_manualvec(dst, A, A);
}

gcc and clang are smart enough to optimize away the redundancy of using the same input twice, so there's no need to make a custom version with intrinsics. clang does a bad job on the scalar auto-vectorizing version, so don't use that. I don't see anything to be gained over this asm output (from Godbolt):

        clang3.9 -O3 -ffast-math -ffp-contract=fast -march=haswell
    vmovupd         ymm0, ymmword ptr [rsi]
    vmovupd         ymm1, ymmword ptr [rsi + 32]
    vunpcklpd       ymm2, ymm0, ymm1
    vunpckhpd       ymm0, ymm0, ymm1   # doing this shuffle first would let the first multiply start a cycle earlier.  Silly compiler.
    vmulpd          ymm1, ymm0, ymm0   # imag*imag
    vfmsub231pd     ymm1, ymm2, ymm2   # real*real - imag*imag
    vaddpd          ymm0, ymm0, ymm0   # imag+imag = 2*imag
    vmulpd          ymm0, ymm2, ymm0   # 2*imag * real
    vunpcklpd       ymm2, ymm1, ymm0
    vunpckhpd       ymm0, ymm1, ymm0
    vmovupd ymmword ptr [rdi], ymm2
    vmovupd ymmword ptr [rdi + 32], ymm0
    vzeroupper
    ret

Possibly a different instruction ordering would have been better, to maybe reduce resource conflicts. e.g. double the real vector, since it's unpacked first, so the VADDPD could start a cycle sooner, before the imag*imag VMULPD. But reordering lines in the C source doesn't usually translate directly to asm reordering, because modern compilers are complex beasts. (IIRC, gcc doesn't particularly try to schedule instructions for x86, because out-of-order execution mostly hides those effects.)

Anyway, per 4 complex squares:

2 loads (down from 4) + 2 stores, for obvious reasons
4 shuffles (down from 6), again obvious
2 VMULPD (same)
1 FMA + 1 VADDPD (down from 2 FMA. VADDPD is lower latency than FMA on Haswell/Broadwell, same on Skylake).

Matt's version would still be 6 shuffles, and same everything else.

I've copied your code exactly (except changing the names to cproduct and csquare), and the result is perfect but I get a segfault immediately after. — sacheie, Sep 16 '16 at 06:27
Here are my Makefile flags: COMPILER_FLAGS = -std=c11 -O3 -Wall -Wextra -march=skylake -mtune=skylake -mavx2 — sacheie, Sep 16 '16 at 06:28
With compiler set to gcc. And here is my code: int main() { double complex dst; double complex a = 2.5341 + 1.843 * I; csquare(&dst, &a); printf("%f + %f\n", creal(dst), cimag(dst)); return 0; } Am I misunderstanding something about the 'restrict' keyword? — sacheie, Sep 16 '16 at 06:30
@sacheie. That's unsurprising. Notice the comment `// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]` at the top of cproduct, describing what memory it reads and writes. As far as restrict, if http://en.cppreference.com/w/c/language/restrict describes what you think it does, then no, otherwise yes you're misunderstanding it. IDK what you think it does! — Peter Cordes, Sep 16 '16 at 06:31
Ahh, that was quite stupid of me. Thank you for your patience. I had forgotten to reference dst properly; it had nothing to do with 'restrict.' — sacheie, Sep 16 '16 at 06:39

How to square two complex doubles with 256-bit AVX vectors?

2 Answers2

This answer covers the general case of multiplying two arrays of complex numbers

One vector result at a time, with FMA:

See my other answer for the general case of multiplying different complex numbers, not squaring.

Linked