tldr: It probably doesn't matter, just load the data twice.
I benchmarked loading the data twice vs once and it seems that loading the data twice is faster for smaller sizes, but as the number of elements transformed increases doing an rotate in becomes negligibly faster.
NUM_FLOATS = 1 << 8
Run on (4 X 3299.05 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 0.30, 0.15, 0.05
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_adjacent_load_twice 13.4 ns 13.4 ns 51912108
BM_adjacent_load_once 20.0 ns 20.0 ns 34998915
NUM_FLOATS = 1 << 16
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_adjacent_load_twice 15353 ns 15353 ns 43726
BM_adjacent_load_once 14747 ns 14747 ns 47232
Re "SSSE3 has palignr which can do exactly that (the AVX2 variant of that instruction is almost useless": not exactly
Therefore we need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.
https://web.archive.org/web/20170422034255/https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations
You can find this implemented here in Vc:
switch (amount) {
case 1:
return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, sizeof(float))
case 2:
return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, 2 * sizeof(float))
case 3:
if (6u < Size) {
return _mm256_alignr_epi8(_mm256_permute2x128_si256(a, b, 0x21), a, 3 * sizeof(float))
}
else assert(0);
}
As for c++ header libraries:
Vc provides a shifted
function with a overload that takes a shift in parameter that seems to do the best thing for each architecture.
Vector Vector::shifted(int amount, Vector<T, Abi> shiftIn) const
xsimd provides shift_left / shift_right
which shifts in zeros so you could combine it with bitwise or |
. However, the performant might be questionable because, while in sse the can do it in one instruction i.e. _mm_slli_si128
, in other architectures they require many.
EVE seems to be similar to xsimd.