You have 2 good options for a single-uop lane-crossing shuffle, that you can use between a 512-bit load and store to shuffle the whole cache line. (vpsrldq
would do 4 separate 128-bit right shifts so that's unfortunately not what you want.)
vpermd
would need a vector control operand, and zero-masking to "shift" in a zero. So the compiler would need extra instructions to load the control vector, and to kmov
a constant into a mask register.
valignd
is a 32-bit granularity fully lane-crossing version of SSSE3 / AVX2 vpalignr
. But it doesn't have any of that horrible AVX / AVX2 "in lane" behaviour where it does multiple separate 128-bit shuffles so it's actually usable to shift a whole 256 or 512-bit vector left or right by a constant number of dwords. You need either zero-masking or a zeroed vector to shift in zeros from. A zeroed vector is as cheap as a NOP to create on Intel CPUs.
(perf numbers from https://www.uops.info/table.html - valignd
is 1 uop for port 5 on Skylake-AVX512, same as vpermd
or even vpermt2d
which could similarly grab a zero from another register.)
#include <immintrin.h>
alignas(16) int history[16]; // C++ has had portable syntax for alignment since C++11
// assumes aligned pointer input
void shift64_right_4bytes(int *arr) {
__m512i v = _mm512_load_si512(arr); // AVX512 load intrinsics conveniently take void*, not __m512i*
v = _mm512_alignr_epi32( _mm512_setzero_si512(), v, 1 ); // v = (0:v) >> 32bits
_mm512_store_si512(arr, v);
}
Compiles to this asm (Godbolt):
# GCC10.2 -O3 -march=skylake-avx512
shift64_right_4bytes(int*):
vpxor xmm0, xmm0, xmm0
valignd zmm0, zmm0, ZMMWORD PTR [rdi], 1
vmovdqa64 ZMMWORD PTR [rdi], zmm0
vzeroupper
ret
Obviously the vpxor
-zeroing and vzeroupper
overhead could be hoisted/sunk out of loops after inlining, if you had an outer loop around that loop you showed.
So the real ALU work is just 1 uop for port 5. Of course, if you wrote this array with narrower stores very recently, you could get a store-forwarding stall. Could still be worth it, just extra latency to load, doesn't actually stall the whole pipeline or out-of-order execution of independent work.
If the rest of your code doesn't use 512-bit vectors, you might want to avoid them here (SIMD instructions lowering CPU frequency)
2x 256-bit loads that overlap by one int
might be good, then store them. i.e. a 15-byte memmove with the same strategy that glibc's memcpy / memmove uses for small copies. Then store a zero at the end.
// only needs AVX1
// With 64-byte aligned history, no load or store crosses a cache-line boundary
void shift64_right_4bytes_256b(int *history) {
__m256i v0 = _mm256_loadu_si256((const __m256i*)(history+1));
__m256i v1 = _mm256_load_si256((const __m256i*)(history+8));
_mm256_store_si256((__m256i*)history, v0);
_mm256_storeu_si256((__m256i*)(history+7), v1); // overlap by 1 dword
history[15] = 0;
}
Or maybe valignd ymm
for the high half, to shift a zero into the vector instead of a separate scalar store. (That would require AVX512VL instead of just AVX1 for this version, but that's fine on AXV512 CPUs.)
Partly depends on how you want to reload it, and whether the surrounding code does a lot of stores. (Back-end pressure on the store execution units and store buffer).
Or if it was originally stored with 2x 256-bit aligned stores, then the unaligned load could hit a store-forwarding stall which you could avoid by using valignd
to shift a dword between the high and low halves, as well as to shift a zero into the high half.