Simple AVX512 dot-product loop only 10.6x faster, expected 16x

Question

The task is to sum the products of multiplying each float in array A with the corresponding element in array B. The arrays could have tens of thousands of elements, and must run say 100,000x sec to handle a real-time data stream, so performance is key.

I've coded it using regular math and again with AVX512. It is about 10.6x faster, which is a surprise, as I expected 16x or so given that I'm doing 16x operations per instruction. Furthermore while the loop has various overhead (e.g., looping variables, increments, branch if continuing loop, etc.) it is doing 1/16th of those compared to the naive version.

I'm compiling in Visual Studio 2022 Community, in release mode, and running on an i7-11700F.

Here's the lines of code. I basically step through the two arrays 16 elements at a time, multiply the respective elements, and keep 16 running sums. At the very end of calculation I use _mm512_reduce_add_ps() to sum those 16 sums.

vector<__m512>      a512In;
vector<__m512>      a512IRCurr;
__m512 fOut = _mm512_set1_ps( 0.0 );

for ( iSample = 0; iSample < iIterations; iSample++ ) 
    fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
                                               a512IRCurr[ iSample ] ) );

I see vmobups doesn't assume the target is aligned, and wonder if that's the problem. However I also see that the unaligned versions have been the same speed for many generations as the aligned versions, but a troubling note that latency may still differ: https://community.intel.com/t5/Intel-ISA-Extensions/what-are-the-performance-implications-of-using-vmovups-and/m-p/1143448 While I'm comfortable with machine language of the 6502 variety I don't know modern Intel.

I also wonder if the _mm512_add_ps is the right instruction for a = a * b constructs, or whether there's a faster a *= b type instruction.

    for ( iSample = 0; iSample < iIterations; iSample++ )
00007FF6677B2958  movsxd      r8,edi  
00007FF6677B295B  test        edi,edi  
00007FF6677B295D  jle         Circle2AVR512::ProcessInput+0AEh (07FF6677B298Eh)  
        fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
00007FF6677B295F  movsxd      r9,eax  
00007FF6677B2962  mov         rdx,r11  
00007FF6677B2965  shl         r9,6  
00007FF6677B2969  mov         ecx,edi  
00007FF6677B296B  sub         r9,r11  
00007FF6677B296E  add         r9,qword ptr [r10]  
00007FF6677B2971  vmovups     zmm0,zmmword ptr [r9+rdx]  
00007FF6677B2978  vmulps      zmm1,zmm0,zmmword ptr [rdx]  
00007FF6677B297E  lea         rdx,[rdx+40h]  
00007FF6677B2982  vaddps      zmm2,zmm2,zmm1  
00007FF6677B2988  sub         r8,1  
00007FF6677B298C  jne         Circle2AVR512::ProcessInput+91h (07FF6677B2971h)  
                                                   a512IRCurr[ iSample ] ) );

I wonder if you are running into memory bandwidth limitations? — Jeremy Friesner, Apr 01 '23 at 04:03
@JeremyFriesner: Yeah bandwidth limitations are likely, even though the loop bottlenecks on `vaddps` latency of one iteration per 4 cycles, it's 2x 64-byte loads per 4 cycles, or 32 bytes per cycle. (If optimized with multiple vector accumulators to hide FP latency and probably reduce rounding error, it would bottleneck on 2 loads per clock for one FMA per clock, if data was hot in L1d cache. [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) ) — Peter Cordes, Apr 01 '23 at 04:09
Odd that you chose not to use `_mm512_fmadd_ps`. Anyway, since you used `std::vector<__m512>`, your data should be aligned by 64 as long as you compiled as C++17 or later. `vmovups` performs identically to `vmovaps` when the data is aligned at runtime. AVX-512 is slower with unaligned arrays than AVX2, like maybe 15% slowdown instead of 2%. — Peter Cordes, Apr 01 '23 at 04:14
Did you compile with full optimization? MSVC appears to have done a terrible job here, not even optimizing out sign-extension of a signed 32-bit `int` loop counter. And reloading a pointer from one of the `std::vector` control blocks inside the loop? Oh wait, the actual loop body starts with the `vmovups`, the rest is all setup. The actual loop is optimal, including the choices of addressing modes indexing one relative to the other. (Except that it chooses not to contract the mul+add into FMA.) — Peter Cordes, Apr 01 '23 at 04:17
@PeterCordes you've helped me several times over the years, last time being atomic operations. Many thanks. You seem to be suggesting I should "optimize with multiple vector accumulators" but I don't understand? — Swiss Frank, Apr 01 '23 at 04:18
I linked a Q&A specifically about that for a reason. Go read it, and/or some of the other answers linked at the top of my answer to it. But if you're bottlenecked on memory bandwidth, that won't help. AFAIK, nothing can help in that case, except for cache-blocking your loops so some of the data is hot in a closer level of cache. — Peter Cordes, Apr 01 '23 at 04:20
@PeterCordes We can assume L1d (data?) cached. 1) this is a test program that is simply doing a tight loop on calling this function and doing nothing else. 2) even when used for signal processing it'd be hammering at this data anyway. 3) It's (currently) single-threaded, 4) Array a512IRCurr is not modified for life of the run, 5) Array a512In is a circular queue with only 1 float written per run. (5) has no impact due to single-thread but even parallelized there'd be only one cache line written by main thread for worker threads to reload. — Swiss Frank, Apr 01 '23 at 04:23
The total size of both your arrays is less than 48 KiB? If so, they yes they'll probably stay hot in L1d. Otherwise no. Also no if you touch too much other data between runs of this loop. — Peter Cordes, Apr 01 '23 at 04:31
(5) If you're only writing one float per run, (a) isn't it inconvenient to have a `vector<__m512>` instead of `vector`? And (b), wouldn't it be way more efficient to subtract the old `a512In[ iPos ] * a512IRCurr[ iSample ]` from the dot product and add in the new, like `dot_product += a512IRCurr[ iSample ] * (new_val - old_val)`? Or `iPos` and `iSample` line up differently so you need to recompute because everything lines up differently; I assume that's what you mean by circular queue. — Peter Cordes, Apr 01 '23 at 04:32
@PeterCordes `Odd that you chose not to use _mm512_fmadd_ps` ... considering I don't know what I'm doing, not terribly odd, no :-D OK, Intel Intrinsics Guid describes `_mm512_fmadd_ps` as dest = a * b + c, so I could replace two instructions with one. Compiler now issues `vfmadd231ps` but performance identical 10.6x. — Swiss Frank, Apr 01 '23 at 04:35
No, arrays FAR bigger than 48kb. `I should have 16 MB Intel® Smart Cache so why so low?` Typical would be 20,000 entries so 160kb for just one of the arrays. But it's worse: I have 16 copies of a512IRCurr, each one shifted by one float, on the (apparently mistaken!?) impression that these accesses HAD to be aligned. Now I'm realizing that was too clever by half? Unaligned access probably costs either nothing or far less than cache misses and it's textbook linear access across all 16 arrays which is worst-case for caching: if cache isn't big enough, EVERY access would be a miss. — Swiss Frank, Apr 01 '23 at 04:47
I declared the vector to be of __m512 then made float* af = (float*) &... to allow easy access to individual elements. It's not textbooky but looks clearer than 3-4 alternatives. I can't subtract old and add new because the offset between the two arrays slides by one more element with each iteration. I have to multiply the elements of a FIXED array, times the MOST RECENT samples of input data. I get one new input sample every call, so slide by one array element. If it weren't for this sliding nature yes, I see your point of subtracting old and adding new would be hugely optimizing. — Swiss Frank, Apr 01 '23 at 05:04
The L1 data cache is only 48KB (per core). The 16 MB is the L3 cache, which is considerably slower than L1 cache. — Jerry Coffin, Apr 01 '23 at 07:27
`float* af = (float*) &...` is strict-aliasing UB, unless you use a GNU C extension like `typedef float aliasing_float __attribute__((may_alias))`. See [GCC AVX \_\_m256i cast to int array leads to wrong values](https://stackoverflow.com/q/71364764) for a real-life case of breakage. (It might actually be safe for `float` onto `__m512` because GNU C defined `__m512` as a vector of `float` elements, but I wouldn't count on it being fully safe.) — Peter Cordes, Apr 01 '23 at 08:07
I was wondering how you solved the problem of offset by 1 in a `vector<__m512>` where a `__m512` has 16 floats. You're correct, having 16 copies of the other array is very bad, much worse than unaligned accesses or aligned+ `valignd`, especially since your arrays will fit in L2 if you don't do that. Jerry Coffin is correct, L3 is 16 MiB. The faster per-core caches are 48K L1d and 512K L2. https://en.wikichip.org/wiki/intel/microarchitectures/rocket_lake . See also Intel's optimization manual re: sustained bandwidth from various levels of cache. You can use perf counters to see misses — Peter Cordes, Apr 01 '23 at 08:11
If your arrays were so big that they always had to miss in cache all the way to DRAM, it might conceivably be a small win to have 16 copies so you could be doing aligned loads. But a bigger win would probably be figuring out how to avoid that. — Peter Cordes, Apr 01 '23 at 08:15
@SwissFrank Re: "I have 16 copies of a512IRCurr, each one shifted by one float [...]". Do you happen to compute something like a convolution? If so, have a look how to do this using Fast Fourier Transformations. For sufficiently large input FFT safes an order of magnitude compared to a naïve implementation (even if you have to pad some of your inputs). — chtz, Apr 01 '23 at 09:11
@PeterCordes tests (graphed below) show the aligned version actually is fairly reliably faster than unaligned, until the exact point aligned version runs out of L1D cache. — Swiss Frank, Apr 02 '23 at 17:55
@chtz It IS a convolution, well spotted, details below. My goal wasn't specifically to code a fast convolution, but to refresh my SIMD and CUDA skills, so that I could do some SIMD and CUDA work in my day job. On the other hand I have a private need for a convolution, and while I'd like to do it efficiently (and learn the technique you describe) I think SIMD is already fast enough for even my biggest potential data sets. I'll write up my findings (also basic circular queue optimizations, manual loop unrolling, thread pool etc.) in a blog article but then might come back to look at FFT. — Swiss Frank, Apr 02 '23 at 18:02
@PeterCordes I updated the graph in my answer adding a hand-unrolled loop (just 8 multiplies and adds in a row). I'm surprised at how fast it is, and astonished it's faster than AVX for up to 100 floats. I'm equally astonished that aligned AVX exceeds 16x faster; how the heck is that possible? — Swiss Frank, Apr 03 '23 at 08:31
You didn't show the C++ or asm code for it, or compilation options, so it's hard to say. e.g. perhaps it got auto-vectorized by the compiler better than you did by hand. If it doesn't vectorize, 1 scalar FMA per clock (2 loads / clock, half of max FMA throughput) is half the speed of 1 vector of 8 FMAs per 4 clocks (FMA latency). But perhaps the compiler does a better job with cleanup than you did for [a horizontal sum](https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction)? You didn't show that part. — Peter Cordes, Apr 03 '23 at 08:55
Of course to run fast for small to medium arrays (many vectors but fitting in L1d, which is indeed 32K on rocket lake; I mis-stated it earlier as 48K but that's the L1i size), you want 8 to 12 `__m256` *vector* accumulators, as shown in [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527). But that means the size threshold for entering the vectorized loop is higher. (If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.) — Peter Cordes, Apr 03 '23 at 09:02
Or wait, when you say "AVX", are you still talking about AVX-512, not AVX1+FMA? Or AVX *without* FMA? Perhaps clang is contracting scalar `a*b+c` to FMA, but not intrinsics? Also, depending on how you benchmark, very small sizes could benefit from out-of-order exec across runs. Although I'd expect that to help the vectorized version more, since without unrolling with multiple accumulators, you bottleneck on latency not throughput, but there's instruction-level parallelism across iterations. That could be where your 20x speedup comes from, dropping off after ROB/RS capacity. — Peter Cordes, Apr 03 '23 at 09:06
Related: [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) re: out-of-order exec overlapping two dependency chains. — Peter Cordes, Apr 03 '23 at 09:07

Swiss Frank · Accepted Answer · 2023-04-03T16:45:39.607

The TLDR: it's memory speed. My L1 cache supports 16-18x speedup, L2 about 10-12x, and L3 about 4.3x, when each is compared to a naive single-data-per-instruction C++ implementation. Once data no longer fits in L1, my CPU cannot fully utilize the AVX512 instructions' parallelism. That's the narrow answer to my narrow question.

The more in-depth answer: I'm sheepish to take credit for answering when so I've gotten so many ideas from comments on the post, but here are my findings.

The program processes input data sample by sample, which is stored in a circular queue, and for every sample multiplies the last N samples of input times the corresponding value in a second array called an impulse response. To be clear, I multiply the newest input sample with cell 0 in the impulse response array, the second-newest with cell 1 and so on. The newest input sample I multiply by cell 0 in the impulse response this time, will be multiplied by cell 1 instead when I'm called again for the next input sample, and so on. All of these products are summed, and that sum is the output of the function.

I made a dozen variations on this matrix multiplication, and use as the baseline one I called "Circle2" (a type of circular queue that doesn't check for wraparound with every operation, rather is composed of two loops whose starting point and iteration count are calculated before entering the loops). Circle2AVX512 basically keeps the same data structures but steps by multiples of 16, and casts float*'s to __m512*. Accesses on the input queue happen to still be aligned but accesses to the impulse response are mostly non-aligned. Circle2AVX512Aligned keeps 16 copies of the impulse response, one at each possible memory alignment. Accesses again are always aligned on the circular queue, but now also aligned on the impulse response, by selecting the edition of the impulse response with the desired alignment.

My CPU, the i7-11700F, has 32kb L1D (D=data) cache and 256kb L2 cache per core. Then, 16M L3 cache serves the whole computer. (I was web-surfing, youtubing and word processing so not all of this was necessarily available to the test.)

Looking at Circle2AVX512, my single-threaded app was able to get 16-18x speedup over a naively-programmed though optimized C++ implementation, as long as the data (input buffer plus impulse response) is big enough to overcome the fixed overhead but small enough to fit in L1--32k total, or in other words, 16k each for each of two structures: 4000 samples each, in 4-byte floats.

Then performance fell to a lower plateau of 10-12x improvement over naive single-data math, until the two structures grew past 256k, at which L2 was no longer sufficient.

At that point it maintains a steady 8.4x to 9.0x speedup over the naive code at least up to 6.4M (two 3.2M data structures).

A future direction of research would be to see how it performs when the L3 cache is done.

Note the access pattern of memory is the classic worst-case for an MRU cache: linear access, mockingly dropping data from the cache just before you're about to finally get back to it. However, as a direction for further work, the software could also take advantage of the classic-best-case for an MRU cache: extreme locality of reference. If the software were presented with multiple input samples in a group, this could be highly optimized by utilizing the same ranges of both the input queue and the impulse response before moving on, and thus achieving extremely high locality of reference. Calculating the first sample would still have to re-load all data from cache but the next group of input samples could free-ride off that loaded cache.

Frankly I don't understand why the Circle2AVX512Aligned algo is performing as well as we see here. Instead of one copy of the impulse response, it has 16 completely separate copies. So it should run out of L1 cache at about 3.75k per data structure (again, the input queue is the same size in bytes as the impulse response; there's simply 16 impulse responses instead of one). And indeed, that's where we see it switch from significantly out-performing the non-aligned version, to underperforming. And yet I'm still confused why, on this modern CPU that sources tell me should be able to access unaligned memory as fast, why the aligned version is so much faster. And I'm also confused why the aligned version doesn't get far slower than it does. My only guess is that the input queue stays nailed to the cache so at least we have that going for us.

Likewise the Circle2AVX512Aligned no longer fits into L2 cache at 256k/17 when the individual structures are 256k/17=9.17k each. OK, a gap does open up at that point but not as big as I'd have expected.

Finally the Circle2AVX512Aligned no longer fits into L3 cache at 16M/17 when the individual structures are 16M/17=963k each, or so. And this time, much clearer degradation.

*should be able to access unaligned memory as fast,* - no, unaligned loads (aka alignment-not-required, like `vmovups`) are equally fast when the data actually is aligned. It's a known fact that unaligned loads are a slowdown when the data is actually crossing a cache line (or worse, page). Certainly for data that was hot in L1d, one cache-line-split load costs about as much as two aligned loads: 2 accesses to cache. And takes up a line-split buffer for some time. For 512-bit vectors, *every* misaligned load is a cache-line split. — Peter Cordes, Apr 03 '23 at 09:13
For 256-bit vectors, misaligned large arrays are basically not a problem, like 2% slowdown maybe; there was plenty of time for the extra L1d cache accesses while waiting for data to arrive from L3 or DRAM. But for some reason, with AVX-512 where *every* misaligned load is a cache-line split, looping over a misaligned array seems to reduce bandwidth by about 15%. See also [How can I accurately benchmark unaligned access speed on x86\_64?](https://stackoverflow.com/a/45129784) for some test results. — Peter Cordes, Apr 03 '23 at 09:15
For a dot product on small to medium sized arrays, 512-bit vectors are likely a big win on Rocket Lake. With only 1x 512-bit FMA per clock, peak FLOPS is the same as with 256-bit vectors, but each FMA needs two loads. And it can do 2 loads per clock regardless of width, if cache bandwidth can keep up. So with data hot in L1d, aligned 512-bit loads can go twice as fast as aligned 256-bit, and at least one of your arrays is always aligned. With a variable-count version of `valignd` e.g. emulated with `vpermt2ps`, you could do all aligned loads. — Peter Cordes, Apr 03 '23 at 09:19
probably impossible to guess, @PeterCordes, without seeing the code, but how could the AVX512 versions go over 16x faster than the naive single-float-per-loop? — Swiss Frank, Apr 03 '23 at 16:50
I already did take a guess at that [in comments under the question](https://stackoverflow.com/questions/75904198/simple-avx512-loop-only-10-6x-faster-expected-16x/75913678#comment133905149_75904198) - Using 512-bit vectors might run few enough iterations for out-of-order exec to overlap them across outer-loop iterations of the benchmark, finding some instruction-level parallelism between separate dependency chains of the latency bottleneck on a single `fOut` of one FMA or ADD per 4 clocks. — Peter Cordes, Apr 03 '23 at 18:26

Peter Cordes · Answer 2 · 2023-04-04T04:35:59.127

Background facts: mul+add can be done as one FMA, (Fused Multiply-Add), e.g. _mm512_fmadd_ps

MSVC defaults to #pragma STDC FP_CONTRACT off, not contracting a*b+c into fma(a,b,c). And doesn't recognize pragmas for it, instead you'd need command line options like -O2 -arch:AVX512 /fp:contract. Or -arch:AVX2 /fp:contract also enables 256-bit FMA, despite FMA3 and AVX2 being separate extensions.

Also, /fp:fast allows other FP optimization, like treating FP math as associative, so it could auto-vectorize a dot product, for example on Godbolt where it unrolls with two vector accumulators to hide some FP latency, unlike your manually-vectorized loop that will bottleneck at one vector per 4 clocks (VADDPS latency), unless cache/memory bandwidth is a worse bottleneck.

But you probably compiled without those options, so the scalar C++ loop compiled to scalar asm using vmulss and vaddss. So a scalar asm equivalent to how your manually-vectorized loop compiled. (MSVC normally doesn't optimize intrinsics, but it actually will contract _mm256_add_ps(_mm256_mul_ps(a,b), c) into _mm256_fmadd_ps(a,b,c). But I'm pretty sure it won't unroll a manually vectorized loop with multiple __m512 or __m256 vector accumulators even with /fp:fast)

Rocket Lake can do 2x 256-bit FP math operations per clock, or 1x 512-bit. It can do 2x 512-bit loads per clock cycle if they're aligned and hit in L1d cache (32KiB per core). FP ADD/MUL/FMA each have 4 cycle latency (https://uops.info/ and https://agner.org/optimize/)

The main bottlenecks for a dot product are:

Cache or DRAM bandwidth: 2 loads per clock if hot in L1d cache, less if data is coming from L2 or farther away. See also Intel's optimization manual re: sustained bandwidth from various levels of cache. You can use perf counters to see misses, using Linux perf stat or whatever equivalent on other OSes, perhaps VTune.

Also somewhat less if some of the loads are misaligned and cross cache-line boundaries (or worse, page boundaries). e.g. a split load has to access L1d cache twice, so it basically costs double. Alignment-not-required loads like _mm512_loadu_ps have no extra cost if the data is actually aligned at run-time.

But if data isn't already hot in L1d, there will be some idle L1d cycles anyway as L2 can't deliver 2 lines per cycle, in fact can't sustain 1 per cycle. So misaligned arrays only slow down a loop by maybe 2% for 256-bit vectors when data's coming from DRAM or maybe L3.

But for 512-bit vectors where every misaligned load must split across cache lines, it's about 15% lower DRAM bandwidth on Skylake-AVX512, probably running out of line-split buffers and not being able to keep as many loads in flight to max out the memory pipeline. Rocket Lake "client" chips might be different, being newer and having lower-latency uncore since they're "client" not "server" chips.
FP latency of 4 cycles from starting a vaddps or vfma...ps until the result is ready for the next operation. This forms a dependency chain which out-of-order exec can't hide, so you're limited to one vector (or scalar) per 4 clocks. Unless you use multiple independent __m512 sum0, sum1, sum2, sum3 variables, and add them together at the end. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for more details and benchmark results with 256-bit vectors on Skylake, where 8 or even 12 vector accumulators helped get close to CPU throughput limits.

With multiple short dot-products, like at the left side of your graphs, out-of-order execution may be overlapping work across iterations of the outer repeat loop, finding instruction-level parallelism between FP adds which doesn't exist within one longer dot product. This may be part of how you're getting speedups greater than 16x. Using FMAs instead of separate mul+add would increase the amount of benefit possible, both from increased peak FLOPs and from having fewer uops for the same amount of work making it easier for out-of-order exec to see farther.

If your dot products were better optimized, there wouldn't be so much left to gain, but your naive scalar version bottlenecks on one add per 4 cycles, as well as only doing one element per add instruction.
FMA or MUL/ADD throughput: well actually, each FMA (or separate MUL/ADD pair) requires 2 loads, so a well-optimized dot product will bottleneck on 2 loads per clock and thus one FMA per clock.

For Rocket Lake with 512-bit vectors, 1 FMA per clock is actually the max, unlike with 256-bit or narrower where the FP ALUs on port 0 and 1 work separately to start 2x MUL/ADD/SUB/FMA per clock cycle. (With 4 cycle latency, fully pipelined, so up to 8 operations in flight). So once you unroll with multiple accumulators, there is actually throughput to be gained from replacing MUL+ADD with FMA, at least when data is hot in L1d cache.

On a chip like Cascade Lake or Ice Lake-Xeon with a 2nd 512-bit FMA unit, or with scalar, 128, or 256-bit vectors, in theory there'd be little to gain, only a reduced number of uops to keep up with the loads, where both could be maxed out running vmulps zmm and vaddps zmm. But that would make more heat which could limit turbo clocks.

On later chips like Alder Lake / Sapphire Rapids which can do 3 loads per clock but still only 2 FP math ops per clock, those 3 loads can feed an average of 1.5 FMAs per clock, and using separate mul+add would be a real bottleneck even with perfect scheduling of uops. Of course assuming you unrolled enough to hide FP latency.

Array a512IRCurr is not modified for life of the run, 5) Array a512In is a circular queue with only 1 float written per run.

(5) If you're only writing one float per run, (a) isn't it inconvenient to have a vector<__m512> instead of vector<float, custom_aligned_allocator>

(Also, @chtz suggests that if you need multiple dot-products over this data, you should use an FFT to do the convolution.)

I have 16 copies of a512IRCurr, each one shifted by one float, on the (apparently mistaken!?) impression that these accesses HAD to be aligned.

Indeed that's terrible, except when your data is really small so all copies fit in L1d cache then you get an easy win of aligned loads.

Loading misaligned data is worth the cost when your arrays are large enough to not fit in L1d cache, as your graph shows. Or perhaps do aligned loads but then get the unaligned window of data you want using valignd. With one FMA per clock, port 5 is free to run a different 512-bit uop every cycle, a shuffle. Except valignd only works with an immediate shift count, making it painful, like you'd need 16 versions of the loop. Instead, you might use vperm2tps (_mm512_permutex2var_ps) with a shuffle-control vector to select the data you want. (The control vector itself could be loaded with one unaligned load from an array of int[] 0..31, perhaps with vpmovzxbd so it's only a 16-byte load and won't cross a cache-line boundary). vpermt2ps zmm is single-uop on Intel, and Zen 4. https://uops.info/

I declared the vector to be of __m512 then made float* af = (float*) &... to allow easy access to individual elements.

float* af = (float*) &... is strict-aliasing UB, unless you use a GNU C extension like typedef float aliasing_float __attribute__((may_alias)). MSVC always works like gcc/clang -fno-strict-aliasing so your code is actually safe there, but not portable.

See GCC AVX __m256i cast to int array leads to wrong values for a real-life case of breakage. It might actually be safe for float onto __m512 because GNU C defines __m512 as a vector of float elements, but I wouldn't count on it being fully safe. I'd still go with an aligned vector or array of float, even though it's somewhat of a pain to get C++ to do over-aligned allocations for a std::vector, like you need a custom allocator as the 2nd template parameter, and the standard library doesn't come with one.

I updated the graph in my answer adding a hand-unrolled loop (just 8 multiplies and adds in a row). I'm surprised at how fast it is, and astonished it's faster than AVX for up to 100 floats. I'm equally astonished that aligned AVX exceeds 16x faster; how the heck is that possible?

Hard to say without seeing the asm or at least the C++ for it, or compilation options. If you used multiple sum variables (accumulators), perhaps it even got auto-vectorized by the compiler using 256-bit vectors?

If not, 1 scalar FMA (or mul/add) per clock (2 loads / clock, half of max FMA throughput) is half the speed of 1 vector of 8 FMAs per 4 clocks (FMA latency), or 1/4 the speed of 1 vector of 16 FMAs per 4 clocks. But perhaps the compiler does a better job with cleanup than you did for a horizontal sum? You didn't show that part.

Of course to run fast for small to medium arrays (many vectors but fitting in L1d), you want 8 to 12 __m256 or __m512 vector accumulators. But that means the size threshold for entering the vectorized loop is higher, and leaves more possible cleanup work.

If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.

`float af = (float*) &` *... is strict-aliasing UB* ... I think you're saying the compiler can't be assumed to realize the float pointer points to the same data the vector's pointer does, and won't necessarily reschedule instructions with that in mind when optimizing? So if I just cast the address of vector 0 element to float every time I should be good: same effect, more typing, but only one variable so optimizer can understand better — Swiss Frank, Apr 04 '23 at 15:48
*Of course to run fast for small to medium arrays (many vectors but fitting in L1d), you want 8 to 12 __m256 or __m512 vector accumulators* -- to be clear you mean, combine the hand-unrolling with the AVX? Yes, there's actually a dozen implementations already and the single-data implementations are written specifically to show and verify ideas to recycle in the multi-data implementations. — Swiss Frank, Apr 04 '23 at 15:50
*If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.* -- don't understand? — Swiss Frank, Apr 04 '23 at 15:51
@SwissFrank: If the size of your array isn't a multiple of 128 elements or something, there will be some left over after the unrolled loop that does 8 vectors of 16 elements each. That could be up to 127 elements, a lot for a 1-at-a-time cleanup loop. And yes, combine unrolling with vectorization. As I keep telling you, read [the Q&A I linked about that)](https://stackoverflow.com/q/45113527) which explains why and has some benchmarks. Or [Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell](https://stackoverflow.com/q/21090873) linked from that. — Peter Cordes, Apr 04 '23 at 17:55
@SwissFrank: Dereferencing `float *array_base = (float*)&vector[0]` would also be strict-aliasing UB because it would point at `__m512` objects, although as you say it's more likely not to cause an actual problem and might work as a hack to get C++17 `std::vector<__m512>` to allocate 64-byte-aligned memory for you without having to include an implementation of an aligned allocator you could use with `std::vector >`, using a non-default 2nd template parameter. — Peter Cordes, Apr 04 '23 at 17:59
is the alternative not as bad, though, declaring it as vector and casting to __m512? — Swiss Frank, Apr 04 '23 at 20:55
@SwissFrank: No, casting to `(__m512*)` is aliasing-safe because `__m512*` is special, like `char*` [it can alias anything](https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond). But it requires alignment. So that's what `_mm512_loadu_ps` is for: unaligned aliasing-safe loads from any object, most convenient for `float*`. Also not `vector`, you probably want `vector >` so the total data spans the fewest total cache lines. — Peter Cordes, Apr 04 '23 at 21:03

Simple AVX512 dot-product loop only 10.6x faster, expected 16x

2 Answers2