x86 Intrinsic: Multiply of 2 complex float vectors

Question

My input is 2 complex float vectors. Both vectors are not interleaved:

VecAReal = Are0, Are1, Are2,...Are[N-1]
VecAImag = Aim0, Aim1, Aim2,...Aim[N-1]

VecBReal = Bre0, Bre1, Bre2,...Bre[N-1]
VecBImag = Bim0, Bim1, Bim2,...Bim[N-1]

I have to run scalar multiply:

VecCReal = Cre0, Cre1, Cre2,...Cre[N-1]
VecCImag = Cim0, Cim1, Cim2,...Cim[N-1]


Cre[i] = Are[i]*Bre[i] - Aim[i]-Bim[i]
Cim[i] = Are[i]*Bim[i] + Aim[i]+Bre[i]

In each iteration, I have to run _mm_load_ps on 4 different pointers.

After sum,add, in each iteration I have to run _mm_store_ps on 2 different pointers.

It seems very inefficient because of many load/store. Can you suggest a better way?

Peter Cordes · Answer 1 · 2023-04-18T11:50:11.897

It's unavoidable that you have to load all the data and store it, unless you can do more work in one pass over your data. e.g. whatever you're going to do next with each Cre/Cim element, do that with vectors of results from your multiply while you still have them in __m128 local vars.

In terms of computational intensity¹, right now (if you were using the correct formula with - Aim * Bim instead of -Aim - Bim, and similar for the Cim result) you're doing 4 multiplies and two add/sub per 4 loads and 2 stores. Or two multiplies and two FMAs. So that's not great, but this data layout is optimal.

If you used a less convenient layout (like interleaved real and complex parts), you'd have more ALU work per vector of results, but it wouldn't be useful work; it's be shuffling overhead created by an inconvenient storage format.

And one __m128 vector of interleaved data would only hold two complex numbers. So you'd be getting the same amount of complex numbers loaded per load operation: 4 loads for 8 complex numbers (4 each from A and B), vs. 2 loads for 4 complex numbers (2 each from A and B). So perhaps the key observation is how many results you're computing per load/store, not just loads/stores per loop iteration.

If you can't fold an earlier or later pass into this work, you could try to cache-block so you hit in L1d or L2 cache, e.g. doing multiple steps of your calculation on a range of i values. How does cache blocking actually speed up performance?

Footnote 1: Computational intensity is amount of ALU work per time you load data into registers. Or per time you bring it into L1d cache, if you're measuring the effectiveness of cache blocking even if you end up loading / storing it multiple times in a small region of your data as you work on it.

Doing just one thing in one pass over a big array is not good for performance, even if it's a complex multiply instead of a simple dot product. Modern CPUs have way more FP math throughput than DRAM or L3 bandwidth.

But Intel since Haswell can do 2 loads and 1 store per clock, or Ice Lake 2 loads + 2 stores per clock. And Alder Lake 3 loads + 2 stores per clock. Assuming they all hit in L1d cache. This is for any scalar or vector up to the max vector width supported, as long as it doesn't cross a cache-line boundary.

AMD since Zen 3 can do 3 loads and 2 stores per clock as well, but only 2 loads + 2 stores for vectors. It also has 2x 256-bit FMA per clock (and unlike Intel, can also do 2x 256-bit FP add at the same time as the FMA/mul operations.) https://uops.info/

So the latest generations of CPUs have quite fast access to L1d to keep their vector ALUs fed, but L2 cache can't even keep up with 1 cache line per clock cycle. But even hitting in L1d, something like a dot product still bottlenecks on load bandwidth even from L1d, not FMA throughput, if properly optimized (unrolling with multiple vectors to hide FMA latency.) Especially for real numbers, not complex.

Cache locality and number of streams

Perhaps you're worried about how many different pointers are involved?

Modern x86 CPUs generally have L1d caches that are at least 8-way associative, so even in the worst case where all arrays are at the same offset relative to a 4k boundary (which is the page size and usual L1d aliasing stride). Skylake-client L2 is only 4-way associative, though, so unfortunate aliasing could lead to some conflict extra misses.

4 read and 2 write streams in one loop is generally fine, including for L2 prefetchers. (For example, Intel's L2 prefetchers support 1 forward and 1 backward stream per 4k page, IIRC. So perhaps there could be a downside if the arrays are so small that the same index in multiple arrays is within the same 4k page.)

See For-loop efficiency: merging loops for some details about multiple load/store streams. (Loads and stores compete for the same LFBs. Skylake for example has 12, so it can track 12 outstanding cache-lines in flight to or from L2 and farther away. The "superqueue" tracking requests from L2 going off-core has a few more entries.)

If you were worried about spatial locality, you could group your data with 16 floats of reals and 16 floats of imaginary parts alternating. e.g. struct {float real[16], imag[16]};, in case you want to use __m512 vectors in the future.

Or if you can easily change the layout in future, for now use groups of 8 floats so 8 complex numbers fit in one 64-byte cache line (if you align the structs), especially if you ever do any random access. And so you waste less space if the number of elements isn't a multiple of 16.

If you can overwrite one of A or B with the result instead of using a separate C, that will save some of the memory bandwidth and reduce the footprint in number of TLB pages, too. (A store to cold memory will result in a load (RFO = Read For Ownership) as well as an eventual eviction of dirty data.)

It would also mean one fewer pointer to keep track of. (Not a big problem for x86-64; it has 15 general-purpose integer registers, separate from its vector registers, so that's plenty for the integer loop-control stuff.)

x86 Intrinsic: Multiply of 2 complex float vectors

1 Answers1

Cache locality and number of streams