2

Are there built-in instructions to perform both right and left shift operation for (16-bits) integer elements in AVX2?

Like the following examples:

[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] --> [16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

and

[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] --> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

where the _mm_srli_si128(H,14) and _mm_slli_si128(H,2) work well on SSE3 16-bit elements. I ask because the performance (running time) is crucial for me.

Paul R
  • 208,748
  • 37
  • 389
  • 560
MROF
  • 147
  • 1
  • 3
  • 9
  • 1
    Duplicates: [8 bit shift operation in AVX2 with shifting in zeros](http://stackoverflow.com/questions/20775005/8-bit-shift-operation-in-avx2-with-shifting-in-zeros) and [Emulating shifts on 32 bytes with AVX](http://stackoverflow.com/questions/25248766/emulating-shifts-on-32-bytes-with-avx) and – Paul R Feb 23 '15 at 08:24

1 Answers1

3

Unfortunately there's no such instructions in AVX2. All AVX2 instruction are SSE2 extended to 256 bit keeping in mind compatibility when used in 128 bit SSE2.

If you know the number of 16-bits integer to shift at compile time you can use combination of permute and shifts. E.g. you can logically break the value into 64 bit chunks, do permutation an shifts of this blocks and than combine it.

That's how I do this in my code

static __m256i m256_srl16_1(__m256i i) {
    // suppose i is [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

    //[4, 3, 2, 1,      16, 15, 14, 13,   12, 11, 10, 9,   8, 7, 6, 5]
    __m256i srl64_q = _mm256_permute4x64_epi64(i, _MM_SHUFFLE(0,3,2,1));

    //[ 1, 0, 0, 0      13, 0, 0, 0       9, 0, 0, 0       5, 0, 0, 0]
    __m256i srl64_m = _mm256_slli_epi64(srl64_q, 3*16);
    //[ 0, 16, 15, 14,  0, 12, 11, 10,    0, 8, 7, 6,      0, 4, 3, 2]
    __m256i srl16_z = _mm256_srli_epi64(i, 1*16);

    __m256i srl64 = _mm256_and_si256(srl64_m, _mm256_set_epi64x(0, ~0, ~0, ~0));
    __m256i r = _mm256_or_si256(srl64, srl16_z);

    return r;
}

If you need to shift for more than 64 bits you need an extra permutation of original value and masking out unneeded bits

Paul R
  • 208,748
  • 37
  • 389
  • 560
sergfc
  • 61
  • 3
  • You can use [`_mm256_blend_epi32`](http://felixcloutier.com/x86/VPBLENDD.html) for a much more efficient blend at the end if the shift is a multiple of 4 bytes. Otherwise maybe a combination of that and `_mm256_blend_epi16` (where the same immediate control is used for both 128b lanes). – Peter Cordes Jul 13 '17 at 19:49
  • Also useful: `_mm256_shuffle_epi8` to put each byte in place within its lane, or zero it. – Peter Cordes Jul 13 '17 at 19:53