One big hotspot in EM simulation within the xnec2c project takes this form, and the same form is repeated throughout the calculation in various ways:
*dst += (*a1) * (*a2) + (*b1) * (*b2) + (*c1) * (*c2); // + (*d1) * (*d2)
(This Github link is a real example from matrix.c):
Typically a1, b1, c1 are complex doubles. Sometimes a2, b2, c2 are complex, other times they are real doubles. (I commented d1,d2 because while they don't exist in the EM simulation, I'm guessing we get them for free in AVX calculations so the code below below writes for 4 doubles).
Below is a modified version of Peter Cordes's answer to:
- Multiply 4 complex numbers against 4 doubles as above
- Sum result.
- Add the sum from #2 to
*dst
static inline void avx_fma_4zd4d_sum(double complex * restrict dst,
const double complex * restrict A,
double *b0, double *b1, double *b2, double *b3)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double *) A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double *) (A + 2)); // [e f g h ]
// Q: b0-b3 are spread out in memory. Is this the best way to load them?
__m256d B0 = _mm256_set_pd(*b1, *b1, *b0, *b0); // reverse from realA order bacause set_pd is.
__m256d B2 = _mm256_set_pd(*b3, *b3, *b2, *b2);
// desired: real=rArB, imag=rBiA
A0 = _mm256_mul_pd(A0, B0);
A2 = _mm256_mul_pd(A2, B2);
// sum A0-A3
A0 = _mm256_add_pd(A0, A2);
// Q: Now A0 has 2x complex doubles in a single vector. How do I sum them?
// Q: Once I do, how do I add the result to dst?
//_mm256_storeu_pd((double *) dst, A0);
//_mm256_storeu_pd((double *) (dst + 2), A2);
}
You can see my questions in the comments above. I was looking at this answer but it sums a vector row of all doubles, mine are complex doubles so it looks like [A0r, A0i, A1r, A1i] and the sums need to interleave. The result would be this if I jumped out of intrinsics, but I'd like to understand the intrinsics detail for this operation:
dst += (A0[0]+A0[2]) + (A0[1]+A0[3]) * I;
Note that *dst
is not in memory near *A
so it will need to be a separate load.
I don't have FMA hardware, but I would like it to compile such that gcc/clang will reduce to FMA if available.
Related Questions:
- What if
a2, b2, c2, d2
are complex? (named b0-b3 in the vector function)- Is summing to
dst
easier when they all complex values?
- Is summing to
- I can typically put
a1, b1, c1
in the same vector, buta2, b2, c2
are all over memory. Is there a better way to load them than using_mm256_set_pd
?
Thanks for your help!