How to shift elements of an array using intel intrinsic

Question

I have an array of size 16 which is aligned to 64 byte boundary which I was trying to shift left by 1 index using intel intrinsics.

int history[16] __attribute__((aligned(64)))
for (std::size_t i = 0; i < 15; i++) {
    history[i] = history[i + 1];
}
history[15] = 0;

This is the initial loop on which I want to use 512 bit wide vector instructions. Any way to to it with low latency intrinsics.

Peter Cordes · Answer 1 · 2020-09-24T20:57:43.990

You have 2 good options for a single-uop lane-crossing shuffle, that you can use between a 512-bit load and store to shuffle the whole cache line. (vpsrldq would do 4 separate 128-bit right shifts so that's unfortunately not what you want.)

vpermd would need a vector control operand, and zero-masking to "shift" in a zero. So the compiler would need extra instructions to load the control vector, and to kmov a constant into a mask register.

valignd is a 32-bit granularity fully lane-crossing version of SSSE3 / AVX2 vpalignr. But it doesn't have any of that horrible AVX / AVX2 "in lane" behaviour where it does multiple separate 128-bit shuffles so it's actually usable to shift a whole 256 or 512-bit vector left or right by a constant number of dwords. You need either zero-masking or a zeroed vector to shift in zeros from. A zeroed vector is as cheap as a NOP to create on Intel CPUs.

(perf numbers from https://www.uops.info/table.html - valignd is 1 uop for port 5 on Skylake-AVX512, same as vpermd or even vpermt2d which could similarly grab a zero from another register.)

#include <immintrin.h>

alignas(16) int history[16];  // C++ has had portable syntax for alignment since C++11


// assumes aligned pointer input
void shift64_right_4bytes(int *arr) {
    __m512i  v = _mm512_load_si512(arr);    // AVX512 load intrinsics conveniently take void*, not __m512i*
    v = _mm512_alignr_epi32( _mm512_setzero_si512(), v, 1 );   // v = (0:v) >> 32bits
    _mm512_store_si512(arr, v);
}

Compiles to this asm (Godbolt):

# GCC10.2 -O3 -march=skylake-avx512
shift64_right_4bytes(int*):
        vpxor   xmm0, xmm0, xmm0
        valignd zmm0, zmm0, ZMMWORD PTR [rdi], 1
        vmovdqa64       ZMMWORD PTR [rdi], zmm0
        vzeroupper
        ret

Obviously the vpxor-zeroing and vzeroupper overhead could be hoisted/sunk out of loops after inlining, if you had an outer loop around that loop you showed.

So the real ALU work is just 1 uop for port 5. Of course, if you wrote this array with narrower stores very recently, you could get a store-forwarding stall. Could still be worth it, just extra latency to load, doesn't actually stall the whole pipeline or out-of-order execution of independent work.

If the rest of your code doesn't use 512-bit vectors, you might want to avoid them here (SIMD instructions lowering CPU frequency)

2x 256-bit loads that overlap by one int might be good, then store them. i.e. a 15-byte memmove with the same strategy that glibc's memcpy / memmove uses for small copies. Then store a zero at the end.

// only needs AVX1
// With 64-byte aligned history, no load or store crosses a cache-line boundary
void shift64_right_4bytes_256b(int *history) {
    __m256i  v0 = _mm256_loadu_si256((const __m256i*)(history+1));
    __m256i  v1 = _mm256_load_si256((const __m256i*)(history+8));
    _mm256_store_si256((__m256i*)history, v0);
    _mm256_storeu_si256((__m256i*)(history+7), v1);   // overlap by 1 dword
    history[15] = 0;
}

Or maybe valignd ymm for the high half, to shift a zero into the vector instead of a separate scalar store. (That would require AVX512VL instead of just AVX1 for this version, but that's fine on AXV512 CPUs.)

Partly depends on how you want to reload it, and whether the surrounding code does a lot of stores. (Back-end pressure on the store execution units and store buffer).

Or if it was originally stored with 2x 256-bit aligned stores, then the unaligned load could hit a store-forwarding stall which you could avoid by using valignd to shift a dword between the high and low halves, as well as to shift a zero into the high half.

How to shift elements of an array using intel intrinsic

1 Answers1