Emulating shifts on 32 bytes with AVX

Question

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.

Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)

Can you recommend me a short substitute ?

UPDATE:

_mm256_slli_si256 is efficiently achieved with

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

or

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

for shifts larger than 16 bytes.

But the question remains for _mm256_srli_si256.

How about reminding us what those slli instructions do, or even better what you want to do exactly? Did you look at the code generated by gcc with __builtin_shuffle or clang with its own syntax? — Marc Glisse, Aug 11 '14 at 18:03
And what do you mean by "only the upper half" "the rest is zeroed"? That's not what Intel's doc says. — Marc Glisse, Aug 11 '14 at 18:18
The reason why there is no 32-byte shift is that the hardware simply can't do it. The hardware is SIMD, and a full-vector shift is not SIMD. If you find that you're needing such instructions, it might be worth reconsidering the design. You're probably trying to do something non-SIMD using SIMD which often leads to an avalanche of other (performance) problems as well. If it's an issue of misalignment, just use misaligned memory access. On Haswell, misaligned access is almost as fast as aligned access. — Mysticial, Aug 11 '14 at 18:56
@Marc Glisse: "The empty low-order bytes are cleared (set to all '0')." https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-9D1254FD-66F7-4AD0-B4C4-7749D6935063.htm — , Aug 11 '14 at 22:23
@Mysticial: as written in my post, the SSE _mm_slli_si128 performs a full shift. And so did psrlq/psllq in "old" MMX. I assume implementing a full 256 bits barrel shifter was too much asking. I am working on neighborhood image processing functions, which are inherently mixed-aligned. — , Aug 11 '14 at 22:27
@YvesDaoust I believe you are misinterpreting that doc. In *each 128-bit half*, the data is shifted to the left and 0s are used to fill in the empty space on the right. "Low order" is to be understood as **inside the 128-bit lane**. It does not zero a whole lane. By the way, Intel's html doc of the compiler intrinsics sucks, it is often unreadable or wrong, the PDF instruction set reference is much more helpful. — Marc Glisse, Aug 11 '14 at 22:43
@Marc Glisse: that's right, I am updating the question. The problem remains, anyway, as some of the bytes are dropped. — , Aug 12 '14 at 07:01
@Paul R: my question is not a duplicate as it holds for both left and right shifts. The previous one only solves the case of a left shift very efficiently with a `_mm256_alignr_epi8` instruction. Unfortunately, there is no `_mm256_alignl_epi8` correspondence. — , Aug 12 '14 at 10:46
You don't need `_mm256_alignl_epi8` (which is why there is no instruction or intrinsic for this) - `_mm256_alignr_epi8` works for both left and right shift cases (just switch the arguments and adjust the shift value). — Paul R, Aug 12 '14 at 11:37
If you reopen the question I can provide a complete solution. — , Aug 12 '14 at 11:58
@YvesDaoust: OK - voting to re-open, but ideally this question needs to be merged with its [earlier doppelgänger](http://stackoverflow.com/questions/20775005/8-bit-shift-operation-in-avx2-with-shifting-in-zeros). — Paul R, Aug 12 '14 at 12:39
When migrating 128-bit SIMD to AVX-256, it's generally easier to think about the problem in terms of two glued together 128-bit operations, instead of a whole 256-bit operation. Not always ideal, but makes translating them a snap and usually performs better than shoehorning it in with permutes. — Cory Nelson, Aug 28 '14 at 02:52

score 9 · Accepted Answer · answered Aug 12 '14 at 12:54

9

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

_mm256_slli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))

16 < N < 32

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)

_mm256_srli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))

16 < N < 32

_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)

answered Aug 12 '14 at 12:54

1

The key to crossing the inter-lane barrier is `_mm256_permute2x128_si256`, surely ? – Paul R Aug 12 '14 at 14:38
No, I mean performing an operation that assembles bytes from two different lanes. As the doc states, the processor creates a "32-bytes composite" before shifting. The permute handles whole lanes. – Aug 12 '14 at 16:27
On Ryzen and KNL, `_mm256_permute2x128_si256` is slower than [`_mm256_permute4x64_epi64`](http://felixcloutier.com/x86/VPERMQ.html) for permuting lanes of a single vector like you're doing here. – Peter Cordes Jul 13 '17 at 19:57
@PeterCordes: significantly ? – Jul 13 '17 at 20:36
Yes, on Ryzen `vperm2i128` is 8 uops, lat=3 tput=3. `vpermq` is 3 uops, lat=2, tput=2. (Those are actually for the FP equivalents, `vperm2f128` and `vpermpd`, since Agner Fog omitted a lot of AVX2 integer stuff for Ryzen). On KNL, `vpermq` has twice the throughput and 1c lower latency. There's no downside on any CPU, AFAIK; `vpermq` is always at least as good as `vperm2i128` for shuffling within a single vector. Plus, it can fold a load as a memory source operand. – Peter Cordes Jul 13 '17 at 20:42
Update, on Zen2 / Zen3, `vperm2i128` is faster (1 uop) than `vpermq` (2 uops). So it's a tradeoff between Zen1 vs. Zen2/3. :/ https://uops.info/ – Peter Cordes Feb 08 '22 at 22:54

score 5 · Answer 2 · answered Aug 11 '14 at 19:29

Here is a function to bit shift left a ymm register using avx2. I use it to shift left by one, though it looks like it works for up to 63 bit shifts.

//----------------------------------------------------------------------------
// bit shift left a 256-bit value using ymm registers
//          __m256i *data - data to shift
//          int count     - number of bits to shift
// return:  __m256i       - carry out bit(s)

static __m256i bitShiftLeft256ymm (__m256i *data, int count)
   {
   __m256i innerCarry, carryOut, rotate;

   innerCarry = _mm256_srli_epi64 (*data, 64 - count);                        // carry outs in bit 0 of each qword
   rotate     = _mm256_permute4x64_epi64 (innerCarry, 0x93);                  // rotate ymm left 64 bits
   innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC);   // clear lower qword
   *data      = _mm256_slli_epi64 (*data, count);                             // shift all qwords left
   *data      = _mm256_or_si256 (*data, innerCarry);                          // propagate carrys from low qwords
   carryOut   = _mm256_xor_si256 (innerCarry, rotate);                        // clear all except lower qword
   return carryOut;
   }

//----------------------------------------------------------------------------

Interesting. Six instruction is still a lot. I am only looking for byte shifts. — , Aug 11 '14 at 22:30
For byte shifts, 4 instructions should do: shift left, shift right, bring lower lane up, or. — Marc Glisse, Aug 11 '14 at 22:53

score 1 · Answer 3 · edited May 23 '17 at 11:53

If the shift count is a multiple of 4 bytes, vpermd (_mm256_permutevar8x32_epi32) with the right shuffle mask will do the trick with one instruction (or more, if you actually need to zero the shifted-in bytes instead of copying a different element over them).

To support variable (multiple-of-4B) shift counts, you could load the control mask from a window into an array of 0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 or something, except that 0 is just the bottom element, and doesn't zero things out. For more on this idea for generating a mask from a sliding window, see my answer on another question.

This answer is pretty minimal, since vpermd doesn't directly solve the problem. I point it out as an alternative that might work in some cases where you're looking for a full vector shift.

Emulating shifts on 32 bytes with AVX

3 Answers3

_mm256_slli_si256(A, N)

_mm256_srli_si256(A, N)

Linked

Related