Can adjacent transform be speed up over zip transform?

Question

Doing a zip transform with a c++ SIMD header library we might have the following sudo code.

// using xsimd
binary_op = [](const auto& a, const auto& b){ return ...; }
float* a, b, res;
...
for(auto i = 0; i < ...; i += batch_size)
{
    auto batch_a = xs::load_aligned(a += i);
    auto batch_b = xs::load_aligned(b += i);
    auto batch_res = binary_op(batch_a, batch_b);
    batch_res.store_aligned(res += i);
}

I was wondering whether adjacent transform

a0
a1 a0 -> a1+a0
a2 a1 -> a2+a1
a3 a2 -> a3+a2
   a3

could be speed up since we might be able to call load_aligned only once per iteration.

auto batch_a1 = xs::load_aligned(...);
...
for (...)
{
    auto batch_a0 = batch_a1;
    batch_a1 = xs::load_aligned(...);
    
    auto batch_b0 = ...; // somehow create from batch_a0 and batch_a1

    auto batch_res = binary_op(batch_a0, batch_b0);
    batch_res.store_aligned(...);

}

Perhaps someone could suggest how to perform the following kind of operation in simd intrinsics:

([a0, a1, a2, a3], [a4, a5, a6, a7]) -> [a1, a2, a3, a4]

And would this even be likely to cause a speed up?

SSSE3 has `palignr` which can do exactly that (the AVX2 variant of that instruction is almost useless, though). I have no idea whether your C++ SIMD library supports that instruction. Also, whether this causes a speedup depends a lot on your surrounding code (and your target CPU). — chtz, Sep 05 '22 at 00:31
useless -> https://stackoverflow.com/questions/8517970/mm-alignr-epi8-palignr-equivalent-in-avx2 — Tom Huntington, Sep 05 '22 at 00:47
On modern x86 CPUs, unaligned loads are pretty efficient, except when split across a 4k page. A cache-line split across a 64-byte boundary has a bit of extra latency, and costs 2 accesses to L1d cache, but otherwise no extra throughput penalty. As @chtz says, SSSE3 solves this with `palignr`; AVX1/2 don't solve it efficiently. AVX-512F solves it for 32-bit elements with [`valignd`](https://www.felixcloutier.com/x86/valignd:valignq) which still needs an immediate count, but shifts across the whole vector, including YMM or ZMM, not just within 16-byte chunks like useless AVX2 `vpalignr`. — Peter Cordes, Sep 05 '22 at 01:01

Tom Huntington · Accepted Answer · 2022-09-07T02:06:03.303

1

tldr: It probably doesn't matter, just load the data twice.

I benchmarked loading the data twice vs once and it seems that loading the data twice is faster for smaller sizes, but as the number of elements transformed increases doing an rotate in becomes negligibly faster.

NUM_FLOATS = 1 << 8

Run on (4 X 3299.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.30, 0.15, 0.05
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_adjacent_load_twice       13.4 ns         13.4 ns     51912108
BM_adjacent_load_once        20.0 ns         20.0 ns     34998915

NUM_FLOATS = 1 << 16

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_adjacent_load_twice      15353 ns        15353 ns        43726
BM_adjacent_load_once       14747 ns        14747 ns        47232

Re "SSSE3 has palignr which can do exactly that (the AVX2 variant of that instruction is almost useless": not exactly

Therefore we need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.

https://web.archive.org/web/20170422034255/https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations

You can find this implemented here in Vc:

switch (amount) {
case 1:
    return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, sizeof(float))
case 2:
    return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, 2 * sizeof(float))
case 3:
    if (6u < Size) {
        return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, 3 * sizeof(float))
    }
    else assert(0);
}

As for c++ header libraries:

Vc provides a shifted function with a overload that takes a shift in parameter that seems to do the best thing for each architecture.

Vector Vector::shifted(int amount, Vector<T, Abi> shiftIn) const

xsimd provides shift_left / shift_right which shifts in zeros so you could combine it with bitwise or |. However, the performant might be questionable because, while in sse the can do it in one instruction i.e. _mm_slli_si128, in other architectures they require many.

EVE seems to be similar to xsimd.

edited Sep 07 '22 at 02:06

answered Sep 05 '22 at 02:15

Tom Huntington

2,260
10
20

1

The one instruction you're looking for in SSE is SSSE3 `palignr` (intrinsic `_mm_alignr_epi8`); doing it with regular vector shifts would require 2 shifts and an OR. But if you only have SSE2, then yeah you'd want `_mm_slli_si128`. – Peter Cordes Sep 05 '22 at 05:40
1

Most architectures can I think either shuffle or blend byte, and even more likely in 32-bit chunks for whole floats. You don't necessarily need to set up for an OR to emulate `palignr`, if you can blend. – Peter Cordes Sep 05 '22 at 05:43
@PeterCordes yes I suppose shuffle (=permutate??) then blend might be more performant for some header libraries. Anyway, I decided I'll just hard code in AVX2 for now since I can't be bothered setting up compiling simultaneously for all architectures and then runtime dispatching to the correct implementation function. – Tom Huntington Sep 05 '22 at 06:26
1

AVX1/2 `__m256` is the worst case for this; I'd be curious if you got a speedup or not, compared to unaligned loads, with data already hot int L1d cache otherwise both ways probably just bottleneck on L2 bandwidth. (If you test, please mention what microarchitecture, e.g. Zen2 or Skylake). Intel would bottleneck on one shuffle per clock, (yes, aka permute); neither `vpalignr` nor `vperm2f128` can run on the extra shuffle unit in Ice Lake. Zen1 is especially slow at lane-crossing shuffles like vperm2f128 (although IDK how well it handles misaligned 32-byte loads either) – Peter Cordes Sep 05 '22 at 11:50
@PeterCordes Updated with the benchmark – Tom Huntington Sep 07 '22 at 01:58
1

Cool, thanks for testing. Once you bottleneck on L3 bandwidth instead of L1d or at least L2, strategy becomes a less important, although it's interesting that the leader switched. I guess waiting on cache misses makes memory level parallelism important, and possibly load buffer entries consumed to see farther across page boundaries. – Peter Cordes Sep 08 '22 at 02:23
@PeterCordes I think where I went wrong thinking that data loads where significantly slower than arithmetic/logic/bit-operation instructions. "L1 CACHE hit, ~4 cycles, L2 CACHE hit, ~10 cycles", but instructions are similar between 3 - 5 clock cycles – Tom Huntington Sep 08 '22 at 03:02
1

Latency isn't usually a big factor except for L2 misses. It's the 2/clock throughput of loads vs. 1/clock throughput of the relevant shuffles that matters. (Or overall execution-port throughput when shuffles have to compete with the real ALU work.) In this case with a workload that's just storing again after one `vaddps`, load and store-address uops may be competing with each other (on CPUs before Ice Lake), and in general L1d cache bandwidth limits from doing more loads. – Peter Cordes Sep 10 '22 at 00:04

score 1 · Answer 2 · answered Sep 05 '22 at 02:33

On modern x86 CPUs, unaligned loads are pretty efficient, except when split across a 4k page. A cache-line split across a 64-byte boundary has a bit of extra latency, and costs 2 accesses to L1d cache, but otherwise no extra throughput penalty. (How can I accurately benchmark unaligned access speed on x86_64? describes actual performance.)

As @chtz says, SSSE3 solves this with palignr; AVX1/2 don't solve it efficiently. On Core 2, SSSE3 was definitely worth using instead of unaligned loads, but that microarchitecture is 1.5 decades old. See Why is SSE aligned read + shuffle slower than unaligned read on some CPUs but not on others? / and Cacheline splits, take two, from Dark Shikari's blog (x264 lead developer)

AVX-512F has valignd which does this shuffle with 32-bit granularity. It still needs an immediate count, but shifts across the whole vector, including YMM or ZMM, not just within 16-byte chunks like the nearly-useless AVX2 vpalignr (_mm_alignr_epi8 (PALIGNR) equivalent in AVX2)

Unlike AVX2, it might well be worth using for AVX-512, where every misaligned load would be a cache-line split. Could go either way with SSE, depending on the microarchitecture and what else is in the loop.

As for libraries portably exposing a shuffle like this, IDK. I think ARM NEON might have a 2-vector byte shuffle you could use like palign

Can adjacent transform be speed up over zip transform?

2 Answers2

Linked