Background facts: mul+add can be done as one FMA, (Fused Multiply-Add), e.g. _mm512_fmadd_ps
MSVC defaults to #pragma STDC FP_CONTRACT off
, not contracting a*b+c
into fma(a,b,c)
. And doesn't recognize pragmas for it, instead you'd need command line options like -O2 -arch:AVX512 /fp:contract
. Or -arch:AVX2 /fp:contract
also enables 256-bit FMA, despite FMA3 and AVX2 being separate extensions.
Also, /fp:fast
allows other FP optimization, like treating FP math as associative, so it could auto-vectorize a dot product, for example on Godbolt where it unrolls with two vector accumulators to hide some FP latency, unlike your manually-vectorized loop that will bottleneck at one vector per 4 clocks (VADDPS latency), unless cache/memory bandwidth is a worse bottleneck.
But you probably compiled without those options, so the scalar C++ loop compiled to scalar asm using vmulss
and vaddss
. So a scalar asm equivalent to how your manually-vectorized loop compiled. (MSVC normally doesn't optimize intrinsics, but it actually will contract _mm256_add_ps(_mm256_mul_ps(a,b), c)
into _mm256_fmadd_ps(a,b,c)
. But I'm pretty sure it won't unroll a manually vectorized loop with multiple __m512
or __m256
vector accumulators even with /fp:fast
)
Rocket Lake can do 2x 256-bit FP math operations per clock, or 1x 512-bit. It can do 2x 512-bit loads per clock cycle if they're aligned and hit in L1d cache (32KiB per core). FP ADD/MUL/FMA each have 4 cycle latency (https://uops.info/ and https://agner.org/optimize/)
The main bottlenecks for a dot product are:
Cache or DRAM bandwidth: 2 loads per clock if hot in L1d cache, less if data is coming from L2 or farther away. See also Intel's optimization manual re: sustained bandwidth from various levels of cache. You can use perf counters to see misses, using Linux perf stat
or whatever equivalent on other OSes, perhaps VTune.
Also somewhat less if some of the loads are misaligned and cross cache-line boundaries (or worse, page boundaries). e.g. a split load has to access L1d cache twice, so it basically costs double. Alignment-not-required loads like _mm512_loadu_ps
have no extra cost if the data is actually aligned at run-time.
But if data isn't already hot in L1d, there will be some idle L1d cycles anyway as L2 can't deliver 2 lines per cycle, in fact can't sustain 1 per cycle. So misaligned arrays only slow down a loop by maybe 2% for 256-bit vectors when data's coming from DRAM or maybe L3.
But for 512-bit vectors where every misaligned load must split across cache lines, it's about 15% lower DRAM bandwidth on Skylake-AVX512, probably running out of line-split buffers and not being able to keep as many loads in flight to max out the memory pipeline. Rocket Lake "client" chips might be different, being newer and having lower-latency uncore since they're "client" not "server" chips.
FP latency of 4 cycles from starting a vaddps
or vfma...ps
until the result is ready for the next operation. This forms a dependency chain which out-of-order exec can't hide, so you're limited to one vector (or scalar) per 4 clocks. Unless you use multiple independent __m512 sum0, sum1, sum2, sum3
variables, and add them together at the end. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for more details and benchmark results with 256-bit vectors on Skylake, where 8 or even 12 vector accumulators helped get close to CPU throughput limits.
With multiple short dot-products, like at the left side of your graphs, out-of-order execution may be overlapping work across iterations of the outer repeat loop, finding instruction-level parallelism between FP adds which doesn't exist within one longer dot product. This may be part of how you're getting speedups greater than 16x. Using FMAs instead of separate mul+add would increase the amount of benefit possible, both from increased peak FLOPs and from having fewer uops for the same amount of work making it easier for out-of-order exec to see farther.
If your dot products were better optimized, there wouldn't be so much left to gain, but your naive scalar version bottlenecks on one add per 4 cycles, as well as only doing one element per add instruction.
FMA or MUL/ADD throughput: well actually, each FMA (or separate MUL/ADD pair) requires 2 loads, so a well-optimized dot product will bottleneck on 2 loads per clock and thus one FMA per clock.
For Rocket Lake with 512-bit vectors, 1 FMA per clock is actually the max, unlike with 256-bit or narrower where the FP ALUs on port 0 and 1 work separately to start 2x MUL/ADD/SUB/FMA per clock cycle. (With 4 cycle latency, fully pipelined, so up to 8 operations in flight). So once you unroll with multiple accumulators, there is actually throughput to be gained from replacing MUL+ADD with FMA, at least when data is hot in L1d cache.
On a chip like Cascade Lake or Ice Lake-Xeon with a 2nd 512-bit FMA unit, or with scalar, 128, or 256-bit vectors, in theory there'd be little to gain, only a reduced number of uops to keep up with the loads, where both could be maxed out running vmulps zmm
and vaddps zmm
. But that would make more heat which could limit turbo clocks.
On later chips like Alder Lake / Sapphire Rapids which can do 3 loads per clock but still only 2 FP math ops per clock, those 3 loads can feed an average of 1.5 FMAs per clock, and using separate mul+add would be a real bottleneck even with perfect scheduling of uops. Of course assuming you unrolled enough to hide FP latency.
- Array a512IRCurr is not modified for life of the run, 5) Array a512In is a circular queue with only 1 float written per run.
(5) If you're only writing one float per run, (a) isn't it inconvenient to have a vector<__m512>
instead of vector<float, custom_aligned_allocator>
(Also, @chtz suggests that if you need multiple dot-products over this data, you should use an FFT to do the convolution.)
I have 16 copies of a512IRCurr, each one shifted by one float, on the (apparently mistaken!?) impression that these accesses HAD to be aligned.
Indeed that's terrible, except when your data is really small so all copies fit in L1d cache then you get an easy win of aligned loads.
Loading misaligned data is worth the cost when your arrays are large enough to not fit in L1d cache, as your graph shows. Or perhaps do aligned loads but then get the unaligned window of data you want using valignd
. With one FMA per clock, port 5 is free to run a different 512-bit uop every cycle, a shuffle. Except valignd
only works with an immediate shift count, making it painful, like you'd need 16 versions of the loop. Instead, you might use vperm2tps
(_mm512_permutex2var_ps
) with a shuffle-control vector to select the data you want. (The control vector itself could be loaded with one unaligned load from an array of int[]
0..31, perhaps with vpmovzxbd
so it's only a 16-byte load and won't cross a cache-line boundary). vpermt2ps zmm
is single-uop on Intel, and Zen 4. https://uops.info/
I declared the vector to be of __m512 then made float* af = (float*) &...
to allow easy access to individual elements.
float* af = (float*) &...
is strict-aliasing UB, unless you use a GNU C extension like typedef float aliasing_float __attribute__((may_alias))
. MSVC always works like gcc/clang -fno-strict-aliasing
so your code is actually safe there, but not portable.
See GCC AVX __m256i cast to int array leads to wrong values for a real-life case of breakage. It might actually be safe for float
onto __m512
because GNU C defines __m512
as a vector of float elements, but I wouldn't count on it being fully safe. I'd still go with an aligned vector or array of float
, even though it's somewhat of a pain to get C++ to do over-aligned allocations for a std::vector
, like you need a custom allocator as the 2nd template parameter, and the standard library doesn't come with one.
I updated the graph in my answer adding a hand-unrolled loop (just 8 multiplies and adds in a row). I'm surprised at how fast it is, and astonished it's faster than AVX for up to 100 floats. I'm equally astonished that aligned AVX exceeds 16x faster; how the heck is that possible?
Hard to say without seeing the asm or at least the C++ for it, or compilation options. If you used multiple sum
variables (accumulators), perhaps it even got auto-vectorized by the compiler using 256-bit vectors?
If not, 1 scalar FMA (or mul/add) per clock (2 loads / clock, half of max FMA throughput) is half the speed of 1 vector of 8 FMAs per 4 clocks (FMA latency), or 1/4 the speed of 1 vector of 16 FMAs per 4 clocks. But perhaps the compiler does a better job with cleanup than you did for a horizontal sum? You didn't show that part.
Of course to run fast for small to medium arrays (many vectors but fitting in L1d), you want 8 to 12 __m256
or __m512
vector accumulators. But that means the size threshold for entering the vectorized loop is higher, and leaves more possible cleanup work.
If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.