AVX2: What is the best way to multiply and sum 4 complex values with 4 double values?

Question

One big hotspot in EM simulation within the xnec2c project takes this form, and the same form is repeated throughout the calculation in various ways:

*dst += (*a1) * (*a2) + (*b1) * (*b2) + (*c1) * (*c2);  // + (*d1) * (*d2)

(This Github link is a real example from matrix.c):

Typically a1, b1, c1 are complex doubles. Sometimes a2, b2, c2 are complex, other times they are real doubles. (I commented d1,d2 because while they don't exist in the EM simulation, I'm guessing we get them for free in AVX calculations so the code below below writes for 4 doubles).

Below is a modified version of Peter Cordes's answer to:

Multiply 4 complex numbers against 4 doubles as above
Sum result.
Add the sum from #2 to *dst

static inline void avx_fma_4zd4d_sum(double complex * restrict dst,
  const double complex * restrict A, 
  double *b0, double *b1, double *b2, double *b3)
{
    // low element first, little-endian style
    __m256d A0 = _mm256_loadu_pd((double *) A); // [A0r A0i A1r A1i ] // [a b c d ]
    __m256d A2 = _mm256_loadu_pd((double *) (A + 2));   // [e f g h ]

    // Q: b0-b3 are spread out in memory. Is this the best way to load them?
    __m256d B0 = _mm256_set_pd(*b1, *b1, *b0, *b0); // reverse from realA order bacause set_pd is.
    __m256d B2 = _mm256_set_pd(*b3, *b3, *b2, *b2); 

    // desired: real=rArB, imag=rBiA
    A0 = _mm256_mul_pd(A0, B0);
    A2 = _mm256_mul_pd(A2, B2);

    // sum A0-A3
    A0 = _mm256_add_pd(A0, A2);

    // Q: Now A0 has 2x complex doubles in a single vector.  How do I sum them?

    // Q: Once I do, how do I add the result to dst?

    //_mm256_storeu_pd((double *) dst, A0);
    //_mm256_storeu_pd((double *) (dst + 2), A2);
}

You can see my questions in the comments above. I was looking at this answer but it sums a vector row of all doubles, mine are complex doubles so it looks like [A0r, A0i, A1r, A1i] and the sums need to interleave. The result would be this if I jumped out of intrinsics, but I'd like to understand the intrinsics detail for this operation:

dst += (A0[0]+A0[2]) + (A0[1]+A0[3]) * I;

Note that *dst is not in memory near *A so it will need to be a separate load.

I don't have FMA hardware, but I would like it to compile such that gcc/clang will reduce to FMA if available.

Related Questions:

What if a2, b2, c2, d2 are complex? (named b0-b3 in the vector function)
- Is summing to dst easier when they all complex values?
I can typically put a1, b1, c1 in the same vector, but a2, b2, c2 are all over memory. Is there a better way to load them than using _mm256_set_pd?

Thanks for your help!

If you stored your complex numbers as an array of reals and an array of imaginary, it would be much more SIMD-friendly. Then you could very easily do 4 operations in parallel pure vertically, using contiguous `_mm256_loadu_pd` on the doubles to line up with the real and imag parts of corresponding complex numbers, with no shuffling. Although perhaps not such a disaster in this case, since it is possible to do a 128-bit broadcast-load plus `vshufpd`, and you're not doing multiplies between complex numbers, only complex * real? But then you are ending up needing a horizontal sum. — Peter Cordes, Jan 27 '22 at 18:41
@PeterCordes, FYI: In this code the doubles (b0-b3) are real-valued multiplied by complex values so they simply multiply against each real and imag value because b0 * (A0r + A0i) = d*A0r + d*A0i for each b0-b3 and A0-A3. If this where complex*complex then the math would be the same as your original code in the referenced post above. Correct me if I'm wrong, but I think that means no shuffle necessary in this case. — KJ7LNW, Jan 28 '22 at 01:16
How do you think a compiler can implement `_mm256_set_pd(*b1, *b1, *b0, *b0);` without any shuffle instructions? I guess broadcast-load and blend, but that's still extra instructions even if not technically shuffles. Also, you said yourself "Now A0 has 2x complex doubles in a single vector. How do I sum them?" - that obviously requires some kind of horizontal data movement, aka shuffling. (There is a `vhaddpd` instruction, but on real CPUs it decodes to shuffle uops to feed a normal vertical `vaddpd`-like uop.) — Peter Cordes, Jan 28 '22 at 01:18
@PeterCordes, it occurs to me that `[A0r A0i A1r Ai]` are in the vector, so would it be a good idea to split and sum as `__mm128d` and add like this: `_mm128d sum = _mm_add_pd((_mm256_castpd256_pd128(A0), _mm256_extractf128_pd(A0, 1))` ? — KJ7LNW, Jan 28 '22 at 01:23
Yes, that would be the normal way to do that add, costing one shuffle (`_mm256_extractf128_pd`) on top of an add uop, and only getting half the work done with an add instruction vs. if you were using 256-bit adds to work on 4 complex numbers in parallel. Unless you're on AMD Bulldozer-family or Zen1 (where SIMD execution units are only 128-bit wide anyway and 256-bit vertical ops decode to 2 uops), then half-width add is cheaper and extract is cheap. See https://agner.org/optimize/ and https://uops.info/ — Peter Cordes, Jan 28 '22 at 01:28

Soonts · Accepted Answer · 2022-01-28T07:54:11.323

Here’s some pieces you gonna need. Untested.

// 4 FP64 complex numbers in 2 AVX SIMD vectors
struct Complex4 { __m256d vec1, vec2; };

// Change the macro if you have FMA and optimizing for Intel
#define USE_FMA 0

Complex4 add( Complex4 a, Complex4 b )
{
    Complex4 res;
    res.vec1 = _mm256_add_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_add_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply vectors component-wise.
// This is not a complex multiplication, just 2 vmulpd instructions
Complex4 mul_cwise( Complex4 a, Complex4 b )
{
    __m256d r0, r1;
    Complex4 res;
    // Compute the product
    res.vec1 = _mm256_mul_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_mul_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply + accumulate vectors vectors component-wise
// This is not a complex multiplication.
Complex4 fma_cwise( Complex4 a, Complex4 b, Complex4 acc )
{
#if USE_FMA
    Complex4 res;
    res.vec1 = _mm256_fmadd_pd( a.vec1, b.vec1, acc.vec1 );
    res.vec2 = _mm256_fmadd_pd( a.vec2, b.vec2, acc.vec2 );
    return res;
#else
    return add( mul_cwise( a, b ), acc );
#endif
}

// Load 4 scalars from memory, and duplicate them into 2 AVX vectors
// This is for the complex * real use case
Complex4 load_duplicating( double* rsi )
{
    Complex4 res;
#ifdef __AVX2__
    __m256d tmp = _mm256_loadu_pd( rsi );
    res.vec1 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 1, 1, 0, 0 ) );
    res.vec2 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 3, 3, 2, 2 ) );
#else
    res.vec1 = _mm256_setr_m128d( _mm_loaddup_pd( rsi ), _mm_loaddup_pd( rsi + 1 ) );
    res.vec2 = _mm256_setr_m128d( _mm_loaddup_pd( rsi + 2 ), _mm_loaddup_pd( rsi + 3 ) );
#endif
    return res;
}

// Load 4 complex numbers
Complex4 load_c4( double* rsi )
{
    Complex4 res;
    res.vec1 = _mm256_loadu_pd( rsi );
    res.vec2 = _mm256_loadu_pd( rsi + 4 );
    return res;
}

// Multiply 2 complex numbers by another 2 complex numbers
__m256d mul_CxC_2( __m256d a, __m256d b )
{
    // The formula is [ a.x*b.x - a.y*b.y,  a.x*b.y + a.y*b.x ]
    // Computing it as a.xy * b.xx -+ a.yx * b.yy

    __m256d ayx = _mm256_permute_pd( a, _MM_SHUFFLE2( 0, 1 ) ); // a.yx
    __m256d byy = _mm256_permute_pd( b, _MM_SHUFFLE2( 1, 1 ) ); // b.yy
    __m256d p1 = _mm256_mul_pd( ayx, byy ); // a.yx * b.yy

    __m256d bxx = _mm256_permute_pd( b, _MM_SHUFFLE2( 0, 0 ) ); // b.xx

#if USE_FMA
    return _mm256_fmaddsub_pd( a, bxx, p1 );
#else
    return _mm256_addsub_pd( _mm256_mul_pd( a, bxx ), p1 );
#endif
}

Couple more notes.

Consider C++ for such code. For one, operators and overloaded functions help. Also, your whole hotspot can be written with a few lines with Eigen library, with performance often comparable to manually written intrinsics.

Use compiler flags or GCC-specific function attributes to make sure all of these functions are always inlined.

Add the sum from #2 to *dst

It’s never a good idea to store intermediate values for code like that. If you have 6 things to multiply/add on input, compute them all, and only store the result once. Memory access is inefficient in general, read-after-write especially so.

such that gcc/clang will reduce to FMA if available.

Reducing things to FMA is usually good idea on Intel, not necessarily on AMD. AMD chips have twice as high throughput of 50% additions / 50% multiplications instruction mix, compared to 100% FMA. Unlike Intel, on AMD floating point additions and multiplications don’t compete for execution units. By “throughput” I mean instructions/second; marketing people decided that 1 FMA operation = 2 floating point operations, so the FLOPs number is the same.

Instead of 4x `vmoddup` broadcast loads and 2 shuffles, I was thinking 2x `vbroadcastf128` loads to feed 2x `vshufpd` or `vpermilpd` in-lane shuffles if we still want results like `_mm256_set_pd(*b1, *b1, *b0, *b0);` instead of taking a different strategy altogether. (Just need something that can do different shuffles for the high and low lanes). — Peter Cordes, Jan 27 '22 at 18:47
Hi @Soonts, What did you mean by "It’s never a good idea to store intermediate values" because,*dst is the final result, not intermediate. The sum of the multiples needs to be added to *dst as a final result. — KJ7LNW, Jan 28 '22 at 01:18
@KJ7LNW I was under the impression that function only implements part of the algorithm. If it computes the complete things with all 6 inputs, please ignore that part. Also I think I’ve spotted a bug in mul_CxC_2 with FMA enabled, updated that function. — Soonts, Jan 28 '22 at 07:57
@Soonts, ya the `*dst += ...` line is the whole calculation, millions of times! I'll check out your CxC change, thanks! — KJ7LNW, Jan 29 '22 at 06:23
@Soonts, in your mul_CxC_2 function, do `a` and `b` begin in the interleaved format A0r A0i A1r A1i just like the compiler's `complex double` data type? I'm trying to understand the difference between your `mul_CxC_2`, @PeterCordes' `cmul_manualvec` and `cmul_matt` functions here in the Godbolt link in this post: https://stackoverflow.com/a/39521257/14055985 (the godbolt link itself is too big for the comment) — KJ7LNW, Jan 29 '22 at 06:42
@KJ7LNW Indeed, they are interleaved. cmul_manualvec code in that link transposes into real and imaginary SIMD vectors, does vertical-only math on these vectors, then transposes back before storing. The sample code in my answer doesn’t do that, it keeps them interleaved but needs shuffles, and [fm]addsub_pd to multiply complex*complex. I think it’s generally impossible to tell which approach is better, depends on both use cases and CPU model. — Soonts, Jan 30 '22 at 04:32
About performance, addsub/fmaddsub is free i.e. same cost as normal addition/fma. Shuffles are not, but they shuffle within 128-bit halves, these are rather cheap instructions. — Soonts, Jan 30 '22 at 04:45

AVX2: What is the best way to multiply and sum 4 complex values with 4 double values?

1 Answers1