best way to shuffle across AVX lanes?

Question

There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere.

I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations:

on entry:
    x0 contains {a0, a1}
    x1 contains {a2, a3}
    x2 contains {a4, a5}
    x3 contains {a6, a7}
on exit:
    y0 contains {a0, a1, a2, a3}
    y1 contains {a1, a2, a3, a4}
    y2 contains {a2, a3, a4, a5}
    y3 contains {a3, a4, a5, a6}
    y4 contains {a4, a5, a6, a7}

My implementation here below is quite slow. Is there a better way?

y0 = _mm256_set_m128d(x1, x0);

__m128d lo = _mm_shuffle_pd(x0, x1, 1);
__m128d hi = _mm_shuffle_pd(x1, x2, 1);
y1 = _mm256_set_m128d(hi, lo);

y2 = _mm256_set_m128d(x2, x1);

lo = hi;
hi = _mm_shuffle_pd(x2, x3, 1);
y3 = _mm256_set_m128d(hi, lo);

y4 = _mm256_set_m128d(x3, x2);

Does the surrounding code bottleneck on throughput or latency for this? Is surrounding code already bottlenecked on shuffle throughput? (port 5 on Intel CPUs). If so, stores and then overlapping reloads are probably really good on Haswell / Skylake. Or maybe 2x `vinsertf128` and 2x 256-bit stores. — Peter Cordes, Oct 25 '18 at 06:09
What CPU(s) do you care about optimizing for? Just Haswell / Skylake? or AMD as well? (and if so, how much do you care about Bulldozer-family as well as Ryzen?) Do you also care about older Intel AVX1 CPUs like Sandybridge specifically, where unaligned 256-bit loads are less efficient? — Peter Cordes, Oct 25 '18 at 06:09
@Peter, target is Haswell or higher. I tried overlapping load and it helps. I do not even need to save, as data is also already available in contiguous memory. I am surprised it helps. It is guaranteed that every 4 load there is 1 aligned and 3 unaligned. Still the penalty seems negligible. — Fabio, Oct 25 '18 at 07:09
Why didn't you say that in the first place? You said your data was in registers, so I assumed it was the result of a previous calculation you'd just done. Haswell can do 2 loads per clock from L1d, or less when any cross a cache-line boundary. So if you can align your block by 64, it's perfectly efficient. (Like I said in my answer, make sure you compile with `-march=haswell` or `-mtune=haswell`, not just `-mavx`, to avoid gcc's `-mavx256-split-unaligned-load`.) — Peter Cordes, Oct 25 '18 at 07:17
@Peter, they are in registers from previous calculations which required them to be loaded. I perhaps wrongly assumed that was a better starting point. — Fabio, Oct 25 '18 at 07:26
I am going for the solution of overlapping loads. `mtune=haswell` does not bring any noticeable improvement, at least on clang. — Fabio, Oct 25 '18 at 07:55
Clang's default tuning doesn't split unaligned loads. You probably get the same code-gen for this either way, but `-mtune=haswell` is a good idea in general. Or especially `-march=haswell` to take full advantage of popcnt and BMI1/BMI2. — Peter Cordes, Oct 25 '18 at 09:42
@Peter, `Haswell can do 2 loads per clock from L1d, or less when any cross a cache-line boundary`. Does the same apply also to `__m512`? What is the degradation when crossing cache line, but not page, 1 per clock or worse? Is the same true also for `storeu`? — Fabio, Oct 26 '18 at 07:14
Extra latency, and throughput is reduced to something like 1 per clock. And yeah, same for `storeu`. There are also a limited number of split-load buffers, and a perf counter for it, so too many split loads could maybe bottleneck, IDK. Cache read/write ports are also needed by transfers to/from L2 cache, I think. On Skylake-AVX512, I'm not sure if there's an extra penalty worse than split 256-bit accesses for split 512-bit loads/stores, but any misalignment automatically means a cache-line split (because vector width = line size). — Peter Cordes, Oct 26 '18 at 07:20
See [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/q/45128763) for some measurements and info. It's not something I've tested in a lot of detail for SIMD, but when looping over big arrays, having them aligned helps significantly for AVX512 on SKX, vs. only a few % for 256-bit vectors on HSW/SKX. (With data coming from memory or L3) — Peter Cordes, Oct 26 '18 at 07:22

Peter Cordes · Accepted Answer · 2018-11-07T14:44:52.523

With inputs in registers, you can do it in 5 shuffle instructions:

3x vinsertf128 to create y0, y2, and y4 by concatenating 2 xmm registers each.
2x vshufpd (in-lane shuffles) between those results to create y1 and y3.

Notice that the low lanes of y0 and y2 contain a1 and a2, the elements needed for the low lane of y1. And the same shuffle also works for the high lane.

#include <immintrin.h>

void merge(__m128d x0, __m128d x1, __m128d x2, __m128d x3,
     __m256d *__restrict y0, __m256d *__restrict y1,
     __m256d *__restrict y2, __m256d *__restrict y3, __m256d *__restrict y4)
{
    *y0 = _mm256_set_m128d(x1, x0);
    *y2 = _mm256_set_m128d(x2, x1);
    *y4 = _mm256_set_m128d(x3, x2);

    // take the high element from the first vector, low element from the 2nd.
    *y1 = _mm256_shuffle_pd(*y0, *y2, 0b0101);
    *y3 = _mm256_shuffle_pd(*y2, *y4, 0b0101);
}

Compiles pretty nicely (with gcc and clang -O3 -march=haswell on Godbolt) to:

merge(double __vector(2), double __vector(2), double __vector(2), double __vector(2), double __vector(4)*, double __vector(4)*, double __vector(4)*, double __vector(4)*, double __vector(4)*):
    vinsertf128     ymm0, ymm0, xmm1, 0x1
    vinsertf128     ymm3, ymm2, xmm3, 0x1
    vinsertf128     ymm1, ymm1, xmm2, 0x1
    # vmovapd YMMWORD PTR [rdi], ymm0
    vshufpd ymm0, ymm0, ymm1, 5
    # vmovapd YMMWORD PTR [rdx], ymm1
    vshufpd ymm1, ymm1, ymm3, 5
    # vmovapd YMMWORD PTR [r8], ymm3
    # vmovapd YMMWORD PTR [rsi], ymm0
    # vmovapd YMMWORD PTR [rcx], ymm1
    # vzeroupper
    # ret

I commented out the stores and stuff that would go away on inlining, so we really do just have the 5 shuffle instructions, vs. 9 shuffle instructions for the code in your question. (Also included in the Godbolt compiler explorer link).

This is very good on AMD, where vinsertf128 is super-cheap (because 256-bit registers are implemented as 2x 128-bit halves, so it's just a 128-bit copy without needing a special shuffle port.) 256-bit lane-crossing shuffles are slow on AMD, but in-lane 256-bit shuffles like vshufpd is just 2 uops.

On Intel it's pretty good, but mainstream Intel CPUs with AVX only have 1 per clock shuffle throughput for 256-bit or FP shuffles. (Sandybridge and earlier have more throughput for integer 128-bit shuffles, but AVX2 CPUs dropped the extra shuffle units, and they didn't help anyway for this.)

So Intel CPUs can't exploit the instruction-level parallelism at all, but it's only 5 uops total which is nice. That's the minimum possible, because you need 5 results.

But especially if the surrounding code also bottlenecks on shuffles, it's worth considering a store/reload strategy with just 4 stores and 5 overlapping vector loads. Or maybe 2x vinsertf128 to construct y0 and y4, then 2x 256-bit stores + 3 overlapping reloads. That could let out-of-order exec get started on dependent instructions using just y0 or y4 while the store-forwarding stall resolved for y1..3.

Especially if you don't care much about Intel first-gen Sandybridge where unaligned 256-bit vector loads are less efficient. (Note that you'd want to compile with gcc -mtune=haswell to turn off the -mavx256-split-unaligned-load default / sandybridge tuning, if you're using GCC. Regardless of the compiler, -march=native is a good idea if making binaries to run on the machine where you compile it, to take full advantage of instruction sets and set tuning options.)

But if total uop throughput from the front-end is more where the bottleneck lies, then the shuffle implementation is best.

(See https://agner.org/optimize/ and other performance links in the x86 tag wiki for more about performance tuning. Also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, but really Agner Fog's guide is a more in-depth guide that explains what throughput vs. latency is actually about.)

I do not even need to save, as data is also already available in contiguous memory.

Then simply loading it with 5 overlapping loads is almost certainly the most efficient thing you could do.

Haswell can do 2 loads per clock from L1d, or less when any cross a cache-line boundary. So if you can align your block by 64, it's perfectly efficient with no cache-line-splits at all. Cache misses are slow, but reloading hot data from L1d cache is very cheap, and modern CPUs with AVX support generally have efficient unaligned-load support.

(Like I said earlier, if using gcc make sure you compile with -march=haswell or -mtune=haswell, not just -mavx, to avoid gcc's -mavx256-split-unaligned-load.)

4 loads + 1 vshufpd (y0, y2) might be a good way to balance load port pressure with ALU pressure, depending on bottlenecks in the surrounding code. Or even 3 loads + 2 shuffles, if the surrounding code is low on shuffle port pressure.

they are in registers from previous calculations which required them to be loaded.

If that previous calculation still has the source data in registers, you could have done 256-bit loads in the first place and just used their 128-bit low halves for the earlier calc. (An XMM register is the low 128 of the corresponding YMM register, and reading them doesn't disturb the upper lanes, so _mm256_castpd256_pd128 compiles to zero asm instructions.)

Do 256-bit loads for y0,y2, and y4, and use their low halves as x0, x1, and x2. (Construct y1 and y3 later with unaligned loads or shuffles).

Only x3 isn't already the low 128 bits of a 256-bit vector you also want.

Ideally a compiler would already notice this optimization when you do a _mm_loadu_pd and a _mm256_loadu_pd from the same address, but probably you need to hand-hold it by doing

__m256d y0 = _mm256_loadu_pd(base);
__m128d x0 = _mm256_castpd256_pd128(y0);

and so on, and either an extract ALU intrinsic (_mm256_extractf128_pd) or a 128-bit load for x3, depending on the surrounding code. If it's only needed once, letting it fold into a memory operand for whatever instruction uses it might be best.

Potential downside: slightly higher latency before the 128-bit calculation can start, or several cycles if the 256-bit loads were cache-line crossing where 128-bit loads weren't. But if your block of data is aligned by 64 bytes, this won't happen.

This is the correct answer to my question. However, given data is also available as a C array, sequential overlapping load, as per @Peter comment above, are the best solution to my problem. — Fabio, Oct 25 '18 at 07:53
@Fabio: updated again: if you need the same data in both 128 and 256-bit vectors, you can use the low half of 256-bit loads as your 128-bit vectors. — Peter Cordes, Oct 25 '18 at 20:44

best way to shuffle across AVX lanes?

1 Answers1

Linked