For the reduce-add, just do in-lane shuffles and adds (vmovshdup
/ vaddps
/ vpermilps imm8
/vaddps
) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps
to shuffle the desired elements to the bottom. Or vcompressps
with a constant mask to do the same thing, optionally with a memory destination.
Once packed down to a single vector, you have a normal SIMD 128-bit add.
If your arrays are actually larger than 16, instead of vpermps
you could vpermt2ps
to take every 4th element from each of two source vectors, setting you up for doing the +=
part with into x[]
256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).
On SKX, vpermt2ps
is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps
, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[]
with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm
if you want.)
See https://agner.org/optimize/ for instruction tables.
For the load:
Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could
- do a 128->512 broadcast with
VBROADCASTF32X4
(single uop for a load port).
- do an in-lane shuffle with
vpermilps zmm,zmm,zmm
to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps
can either have a m512
or m32bcst
source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps
takes the control vector as a memory operand, not the source data.)
This is slightly better than vpermps zmm,zmm,zmm
because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).
Even outside a loop, loading a shuffle-control vector might still be your best bet.