1-to-4 broadcast and 4-to-1 reduce in AVX-512

Question

I need to do the following two operations:

float x[4];
float y[16];

// 1-to-4 broadcast
for ( int i = 0; i < 16; ++i )
    y[i] = x[i / 4];

// 4-to-1 reduce-add
for ( int i = 0; i < 16; ++i )
    x[i / 4] += y[i];

What would be an efficient AVX-512 implementation?

Are you tuning for KNL (Xeon Phi), or for Skylake-AVX512? There might be a difference. Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register for a 128->512 broadcast + an in-lane shuffle with `vpermilps zmm`? (To broadcast a different element within each 128-bit lane). — Peter Cordes, Oct 13 '18 at 01:14
Skylake. Yes, I have a set of vectors `x[n*4]` and `y[n*16]` and a loop over size 4 and 16 respecrively subvectors of `x` and `y`, so I can have shuffle control in a register — user2052436, Oct 13 '18 at 15:11
Ok good, then the extra stuff about `vpermt2ps` to combine two vectors with per-lane horizontal sums in my answer applies. — Peter Cordes, Oct 13 '18 at 17:32

Peter Cordes · Accepted Answer · 2018-10-13T12:28:54.580

For the reduce-add, just do in-lane shuffles and adds (vmovshdup / vaddps / vpermilps imm8/vaddps) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps to shuffle the desired elements to the bottom. Or vcompressps with a constant mask to do the same thing, optionally with a memory destination.

Once packed down to a single vector, you have a normal SIMD 128-bit add.

If your arrays are actually larger than 16, instead of vpermps you could vpermt2ps to take every 4th element from each of two source vectors, setting you up for doing the += part with into x[] 256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).

On SKX, vpermt2ps is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[] with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm if you want.)

See https://agner.org/optimize/ for instruction tables.

For the load:

Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could

do a 128->512 broadcast with VBROADCASTF32X4 (single uop for a load port).
do an in-lane shuffle with vpermilps zmm,zmm,zmm to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps can either have a m512 or m32bcst source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps takes the control vector as a memory operand, not the source data.)

This is slightly better than vpermps zmm,zmm,zmm because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).

Even outside a loop, loading a shuffle-control vector might still be your best bet.

1-to-4 broadcast and 4-to-1 reduce in AVX-512

1 Answers1