4

I was surprised to see that _mm256_sllv_epi16/8(__m256i v1, __m256i v2) and _mm256_srlv_epi16/8(__m256i v1, __m256i v2) was not in the Intel Intrinsics Guide and I don't find any solution to recreate that AVX512 intrinsic with only AVX2.

This function left shifts all 16/8bits packed int by the count value of corresponding data elements in v2.

Example for epi16:

__m256i v1 = _mm256_set1_epi16(0b1111111111111111);
__m256i v2 = _mm256_setr_epi16(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
v1 = _mm256_sllv_epi16(v1, v2);

Then v1 equal to -> (1111111111111111, 1111111111111110, 1111111111111100, 1111111111111000, ................, 1000000000000000);

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
yatsukino
  • 379
  • 1
  • 4
  • 13
  • @1201ProgramAlarm: true, but the OP wants to emulate them with AVX2, so their code can run on Haswell / Ryzen, instead of only AVX512BW (SKX). And no CPU has `_mm256_sllv_epi8` / `vpsllvb` because it doesn't exist, not even in AVX512VBMI2. I removed the avx512 tag because this is not an avx512 question. – Peter Cordes Aug 11 '18 at 00:37

2 Answers2

4

In the _mm256_sllv_epi8 case, it isn't too difficult to replace the shifts by multiplications, using the pshufb instruction as a tiny lookup table. It is also possible to emulate the right shifting of _mm256_srlv_epi8 with multiplications and quite a few other instructions, see the code below. I would expect that at least _mm256_sllv_epi8 is more efficient than Nyan's solution.


More or less the same idea can be used to emulate _mm256_sllv_epi16, but in that case it is less trivial to select the right multiplier (see also code below).

The solution _mm256_sllv_epi16_emu below is not necessarily any faster, nor better, than Nyan's solution. The performance depends on the surrounding code and on the CPU that is used. Nevertheless, the solution here might be of interest, at least on older computer systems. For example, the vpsllvd instruction is used twice in Nyan's solution. This instruction is fast on Intel Skylake systems or newer. On Intel Broadwell or Haswell this instruction is slow, because it decodes to 3 micro-ops. The solution here avoids this slow instruction.

It is possible to skip the two lines of code with mask_lt_15, if the shift counts are known to be less than or equal to 15.

Missing intrinsic _mm256_srlv_epi16 is left as an exercise to the reader.


/*     gcc -O3 -m64 -Wall -mavx2 -march=broadwell shift_v_epi8.c     */
#include <immintrin.h>
#include <stdio.h>
int print_epi8(__m256i  a);
int print_epi16(__m256i  a);

__m256i _mm256_sllv_epi8(__m256i a, __m256i count) {
    __m256i mask_hi        = _mm256_set1_epi32(0xFF00FF00);
    __m256i multiplier_lut = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);

    __m256i count_sat      = _mm256_min_epu8(count, _mm256_set1_epi8(8));     /* AVX shift counts are not masked. So a_i << n_i = 0 for n_i >= 8. count_sat is always less than 9.*/ 
    __m256i multiplier     = _mm256_shuffle_epi8(multiplier_lut, count_sat);  /* Select the right multiplication factor in the lookup table.                                      */
    __m256i x_lo           = _mm256_mullo_epi16(a, multiplier);               /* Unfortunately _mm256_mullo_epi8 doesn't exist. Split the 16 bit elements in a high and low part. */

    __m256i multiplier_hi  = _mm256_srli_epi16(multiplier, 8);                /* The multiplier of the high bits.                                                                 */
    __m256i a_hi           = _mm256_and_si256(a, mask_hi);                    /* Mask off the low bits.                                                                           */
    __m256i x_hi           = _mm256_mullo_epi16(a_hi, multiplier_hi);
    __m256i x              = _mm256_blendv_epi8(x_lo, x_hi, mask_hi);         /* Merge the high and low part.                                                                     */
            return x;
}


__m256i _mm256_srlv_epi8(__m256i a, __m256i count) {
    __m256i mask_hi        = _mm256_set1_epi32(0xFF00FF00);
    __m256i multiplier_lut = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 1,2,4,8, 16,32,64,128, 0,0,0,0, 0,0,0,0, 1,2,4,8, 16,32,64,128);

    __m256i count_sat      = _mm256_min_epu8(count, _mm256_set1_epi8(8));     /* AVX shift counts are not masked. So a_i >> n_i = 0 for n_i >= 8. count_sat is always less than 9.*/ 
    __m256i multiplier     = _mm256_shuffle_epi8(multiplier_lut, count_sat);  /* Select the right multiplication factor in the lookup table.                                      */
    __m256i a_lo           = _mm256_andnot_si256(mask_hi, a);                 /* Mask off the high bits.                                                                          */
    __m256i multiplier_lo  = _mm256_andnot_si256(mask_hi, multiplier);        /* The multiplier of the low bits.                                                                  */
    __m256i x_lo           = _mm256_mullo_epi16(a_lo, multiplier_lo);         /* Shift left a_lo by multiplying.                                                                  */
            x_lo           = _mm256_srli_epi16(x_lo, 7);                      /* Shift right by 7 to get the low bits at the right position.                                      */

    __m256i multiplier_hi  = _mm256_and_si256(mask_hi, multiplier);           /* The multiplier of the high bits.                                                                 */
    __m256i x_hi           = _mm256_mulhi_epu16(a, multiplier_hi);            /* Variable shift left a_hi by multiplying. Use a instead of a_hi because the a_lo bits don't interfere */
            x_hi           = _mm256_slli_epi16(x_hi, 1);                      /* Shift left by 1 to get the high bits at the right position.                                      */
    __m256i x              = _mm256_blendv_epi8(x_lo, x_hi, mask_hi);         /* Merge the high and low part.                                                                     */
            return x;
}


__m256i _mm256_sllv_epi16_emu(__m256i a, __m256i count) {
    __m256i multiplier_lut = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);
    __m256i byte_shuf_mask = _mm256_set_epi8(14,14,12,12, 10,10,8,8, 6,6,4,4, 2,2,0,0, 14,14,12,12, 10,10,8,8, 6,6,4,4, 2,2,0,0);

    __m256i mask_lt_15     = _mm256_cmpgt_epi16(_mm256_set1_epi16(16), count);
            a              = _mm256_and_si256(mask_lt_15, a);                    /* Set a to zero if count > 15.                                                                      */
            count          = _mm256_shuffle_epi8(count, byte_shuf_mask);         /* Duplicate bytes from the even postions to bytes at the even and odd positions.                    */
            count          = _mm256_sub_epi8(count,_mm256_set1_epi16(0x0800));   /* Subtract 8 at the even byte positions. Note that the vpshufb instruction selects a zero byte if the shuffle control mask is negative.     */
    __m256i multiplier     = _mm256_shuffle_epi8(multiplier_lut, count);         /* Select the right multiplication factor in the lookup table. Within the 16 bit elements, only the upper byte or the lower byte is nonzero. */
    __m256i x              = _mm256_mullo_epi16(a, multiplier);                  
            return x;
}


int main(){

    printf("Emulating _mm256_sllv_epi8:\n");
    __m256i a     = _mm256_set_epi8(32,31,30,29, 28,27,26,25, 24,23,22,21, 20,19,18,17, 16,15,14,13, 12,11,10,9, 8,7,6,5, 4,3,2,1);
    __m256i count = _mm256_set_epi8(7,6,5,4, 3,2,1,0,  11,10,9,8, 7,6,5,4, 3,2,1,0,  11,10,9,8, 7,6,5,4, 3,2,1,0);
    __m256i x     = _mm256_sllv_epi8(a, count);
    printf("a     = \n"); print_epi8(a    );
    printf("count = \n"); print_epi8(count);
    printf("x     = \n"); print_epi8(x    );
    printf("\n\n"); 


    printf("Emulating _mm256_srlv_epi8:\n");
            a     = _mm256_set_epi8(223,224,225,226, 227,228,229,230, 231,232,233,234, 235,236,237,238, 239,240,241,242, 243,244,245,246, 247,248,249,250, 251,252,253,254);
            count = _mm256_set_epi8(7,6,5,4, 3,2,1,0,  11,10,9,8, 7,6,5,4, 3,2,1,0,  11,10,9,8, 7,6,5,4, 3,2,1,0);
            x     = _mm256_srlv_epi8(a, count);
    printf("a     = \n"); print_epi8(a    );
    printf("count = \n"); print_epi8(count);
    printf("x     = \n"); print_epi8(x    );
    printf("\n\n"); 



    printf("Emulating _mm256_sllv_epi16:\n");
            a     = _mm256_set_epi16(1601,1501,1401,1301, 1200,1100,1000,900, 800,700,600,500, 400,300,200,100);
            count = _mm256_set_epi16(17,16,15,13,  11,10,9,8, 7,6,5,4, 3,2,1,0);
            x     = _mm256_sllv_epi16_emu(a, count);
    printf("a     = \n"); print_epi16(a    );
    printf("count = \n"); print_epi16(count);
    printf("x     = \n"); print_epi16(x    );
    printf("\n\n"); 

    return 0;
}


int print_epi8(__m256i  a){
  char v[32];
  int i;
  _mm256_storeu_si256((__m256i *)v,a);
  for (i = 0; i<32; i++) printf("%4hhu",v[i]);
  printf("\n");
  return 0;
}

int print_epi16(__m256i  a){
  unsigned short int  v[16];
  int i;
  _mm256_storeu_si256((__m256i *)v,a);
  for (i = 0; i<16; i++) printf("%6hu",v[i]);
  printf("\n");
  return 0;
}

The output is:

Emulating _mm256_sllv_epi8:
a     = 
   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
count = 
   0   1   2   3   4   5   6   7   8   9  10  11   0   1   2   3   4   5   6   7   8   9  10  11   0   1   2   3   4   5   6   7
x     = 
   1   4  12  32  80 192 192   0   0   0   0   0  13  28  60 128  16  64 192   0   0   0   0   0  25  52 108 224 208 192 192   0


Emulating _mm256_srlv_epi8:
a     = 
 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223
count = 
   0   1   2   3   4   5   6   7   8   9  10  11   0   1   2   3   4   5   6   7   8   9  10  11   0   1   2   3   4   5   6   7
x     = 
 254 126  63  31  15   7   3   1   0   0   0   0 242 120  60  29  14   7   3   1   0   0   0   0 230 114  57  28  14   7   3   1


Emulating _mm256_sllv_epi16:
a     = 
   100   200   300   400   500   600   700   800   900  1000  1100  1200  1301  1401  1501  1601
count = 
     0     1     2     3     4     5     6     7     8     9    10    11    13    15    16    17
x     = 
   100   400  1200  3200  8000 19200 44800 36864 33792 53248 12288 32768 40960 32768     0     0

Indeed some AVX2 instructions are missing. However, note that it is not always a good idea fill these gaps by emulating the 'missing' AVX2 instructions. Sometimes it is more efficient to redesign your code in such a way that these emulated instructions are avoided. For example, by working with wider vector elements (_epi32 instead of _epi16), with native support.

wim
  • 3,702
  • 19
  • 23
  • Can we get any use out of [`vpmaddubsw`](http://felixcloutier.com/x86/PMADDUBSW.html) to do some masking for us? No, we'd need to mask the `vpshufb` result both ways to create `0` in every other element. And for the high half of each pair, we'd need a multiplier of `256 * n`, but that wouldn't fit in a byte. – Peter Cordes Aug 14 '18 at 13:45
  • 1
    @PeterCordes I think the `vpmaddubsw` is a good idea, thanks! Probably it is possible to improve the computation of `x_lo` in `_mm256_srlv_epi8` with that instruction (get rid of one `andnot`). – wim Aug 14 '18 at 14:20
  • 1
    Oh yes, I wasn't thinking about right shift, but the expanding odd and even elements to the bottom of a 16-bit pair is not bad at all. With an immediate `vpsllw` to put the high byte back at the top, we do come out at least one instruction ahead, right? But beware that Haswell runs shift and multiply only on `p0`, so they compete. Or maybe we can use `vpslldq` if shift/mul pressure is worse than shuffle pressure. – Peter Cordes Aug 14 '18 at 14:33
  • @PeterCordes Yes, but at least one of the other byte elements should be masked off anyway, when using `vpmaddubsw`, so it's not always a win. I'll come back to it later... – wim Aug 14 '18 at 15:00
  • Maybe `vpmaddubsw` for the low half, and `vpmulhuw` for the high half? Yeah, I think that's a drop-in replacement that saves one `andnot` – Peter Cordes Aug 14 '18 at 15:04
  • @PeterCordes Somehow `vpmaddubsw` for the low half didn't work. I don't know exactly why, but the results were wrong in some cases. Note that the multiplication of `vpmaddubsw` is a bit weird: unsigned 8-bit * signed 8-bit = saturated signed 16-bit. With the arguments interchanged, some results became right, but others went wrong. – wim Aug 14 '18 at 18:15
  • Oh right, we can't use it because it can't multiply by `+128`. I was thinking we could use the signed input as the shift count (with the unsigned input as the data to be shifted). The signed saturation is a problem, it only matters for adding. `255*127` is only 0x7e81, while `-128 * 255 = -32640` which also fits in signed 16-bit. – Peter Cordes Aug 14 '18 at 20:49
  • If we use the data as the signed input, it would be like *sign* extending to 16-bit before shifting, so that doesn't work if we want the high byte as part of a right shift. – Peter Cordes Aug 14 '18 at 20:53
  • @PeterCordes Thanks for the explanation! – wim Aug 14 '18 at 21:02
3

It's strange that they missed that, though it seems many AVX integer instructions are only available for 32/64-bit widths. At least 16-bit got added in AVX512BW (though I still don't get why Intel refuses to add 8-bit shifts).

We can emulate 16-bit variable shifts using only AVX2 by using 32-bit variable shifts with some masking and blending.

We need the right shift count at the bottom of the 32-bit element containing each 16-bit element, which we can do with an AND (for the low element) and an immediate shift for the high half. (Unlike scalar shifts, x86 vector shifts saturate their count instead of wrapping/masking).

We also need to mask off the low 16 bits of of data before doing the high-half shift, so we aren't shifting garbage into the high 16-bit half of the containing 32-bit element.

__m256i _mm256_sllv_epi16(__m256i a, __m256i count) {
    const __m256i mask = _mm256_set1_epi32(0xffff0000);
    __m256i low_half = _mm256_sllv_epi32(
        a,
        _mm256_andnot_si256(mask, count)
    );
    __m256i high_half = _mm256_sllv_epi32(
        _mm256_and_si256(mask, a),
        _mm256_srli_epi32(count, 16)
    );
    return _mm256_blend_epi16(low_half, high_half, 0xaa);
}
__m256i _mm256_sllv_epi16(__m256i a, __m256i count) {
    const __m256i mask = _mm256_set1_epi32(0xffff0000); // alternating low/high words of a dword
    // shift low word of each dword: low_half = (a << (count & 0xffff)) [for each 32b element]
    // note that, because `a` isn't being masked here, we may get some "junk" bits, but these will get eliminated by the blend below
    __m256i low_half = _mm256_sllv_epi32(
        a,
        _mm256_andnot_si256(mask, count)
    );
    // shift high word of each dword: high_half = ((a & 0xffff0000) << (count >> 16)) [for each 32b element]
    __m256i high_half = _mm256_sllv_epi32(
        _mm256_and_si256(mask, a),     // make sure we shift in zeros
        _mm256_srli_epi32(count, 16)   // need the high-16 count at the bottom of a 32-bit element
    );
    // combine low and high words
    return _mm256_blend_epi16(low_half, high_half, 0xaa);
}

__m256i _mm256_srlv_epi16(__m256i a, __m256i count) {
    const __m256i mask = _mm256_set1_epi32(0x0000ffff);
    __m256i low_half = _mm256_srlv_epi32(
        _mm256_and_si256(mask, a),
        _mm256_and_si256(mask, count)
    );
    __m256i high_half = _mm256_srlv_epi32(
        a,
        _mm256_srli_epi32(count, 16)
    );
    return _mm256_blend_epi16(low_half, high_half, 0xaa);
}

GCC 8.2 compiles this to more-or-less what you'd expect:

_mm256_srlv_epi16(long long __vector(4), long long __vector(4)):
        vmovdqa       ymm3, YMMWORD PTR .LC0[rip]
        vpand   ymm2, ymm0, ymm3
        vpand   ymm3, ymm1, ymm3
        vpsrld  ymm1, ymm1, 16
        vpsrlvd ymm2, ymm2, ymm3
        vpsrlvd ymm0, ymm0, ymm1
        vpblendw        ymm0, ymm2, ymm0, 170
        ret
_mm256_sllv_epi16(long long __vector(4), long long __vector(4)):
        vmovdqa       ymm3, YMMWORD PTR .LC1[rip]
        vpandn  ymm2, ymm3, ymm1
        vpsrld  ymm1, ymm1, 16
        vpsllvd ymm2, ymm0, ymm2
        vpand   ymm0, ymm0, ymm3
        vpsllvd ymm0, ymm0, ymm1
        vpblendw        ymm0, ymm2, ymm0, 170
        ret

Meaning that the emulation results in 1x load + 2x AND/ANDN + 2x variable-shift + 1x right-shift + 1x blend.

Clang 6.0 does something interesting - it eliminates the memory load (and corresponding masking) by using blends:

_mm256_sllv_epi16(long long __vector(4), long long __vector(4)):
        vpxor   xmm2, xmm2, xmm2
        vpblendw        ymm3, ymm1, ymm2, 170
        vpsllvd ymm3, ymm0, ymm3
        vpsrld  ymm1, ymm1, 16
        vpblendw        ymm0, ymm2, ymm0, 170
        vpsllvd ymm0, ymm0, ymm1
        vpblendw        ymm0, ymm3, ymm0, 170
        ret
_mm256_srlv_epi16(long long __vector(4), long long __vector(4)):
        vpxor   xmm2, xmm2, xmm2
        vpblendw        ymm3, ymm0, ymm2, 170
        vpblendw        ymm2, ymm1, ymm2, 170
        vpsrlvd ymm2, ymm3, ymm2
        vpsrld  ymm1, ymm1, 16
        vpsrlvd ymm0, ymm0, ymm1
        vpblendw        ymm0, ymm2, ymm0, 170
        ret

This results in: 1x clear + 3x blend + 2x variable-shift + 1x right-shift.

I haven't done any benchmarking as to which approach is faster, but I suspect it may depend on the CPU, in particular, the cost of a PBLENDW on the CPU.

Of course, if your use case is a little more constrained, the above could be simplified, e.g. if your shift amounts are all constants, you could remove the masking/shifting needed to get that to work (assuming the compiler doesn't do this automatically for you).
For left shift, if the shift amounts are constant, you could use _mm256_mullo_epi16 instead, converting the shift amounts to something that can be multiplied, e.g. for the example you gave:

__m256i v1 = _mm256_set1_epi16(0b1111111111111111);
__m256i v2 = _mm256_setr_epi16(1<<0,1<<1,1<<2,1<<3,1<<4,1<<5,1<<6,1<<7,1<<8,1<<9,1<<10,1<<11,1<<12,1<<13,1<<14,1<<15);
v1 = _mm256_mullo_epi16(v1, v2);

Update: Peter mentions (see comment below) that right-shift can also be implemented with _mm256_mulhi_epi16 (e.g. to perform v>>1 multiply v by 1<<15 and take the high word).


For 8-bit variable shifts, this doesn't exist in AVX512 either (again, I don't know why Intel doesn't have 8-bit SIMD shifts).
If AVX512BW is available, you could use a similar trick to the above, using _mm256_sllv_epi16. For AVX2, I can't think of a particularly better approach than applying the emulation for 16-bit a second time, as you ultimately have to do 4x the shifting of what the 32-bit shift gives you. See @wim's answer for a nice solution for 8-bit in AVX2.

This is what I came up with (basically 16-bit version adopted for 8-bit on AVX512):

__m256i _mm256_sllv_epi8(__m256i a, __m256i count) {
    const __m256i mask = _mm256_set1_epi16(0xff00);
    __m256i low_half = _mm256_sllv_epi16(
        a,
        _mm256_andnot_si256(mask, count)
    );
    __m256i high_half = _mm256_sllv_epi16(
        _mm256_and_si256(mask, a),
        _mm256_srli_epi16(count, 8)
    );
    return _mm256_blendv_epi8(low_half, high_half, _mm256_set1_epi16(0xff00));
}

__m256i _mm256_srlv_epi8(__m256i a, __m256i count) {
    const __m256i mask = _mm256_set1_epi16(0x00ff);
    __m256i low_half = _mm256_srlv_epi16(
        _mm256_and_si256(mask, a),
        _mm256_and_si256(mask, count)
    );
    __m256i high_half = _mm256_srlv_epi16(
        a,
        _mm256_srli_epi16(count, 8)
    );
    return _mm256_blendv_epi8(low_half, high_half, _mm256_set1_epi16(0xff00));
}

(Peter Cordes mentions below that _mm256_blendv_epi8(low_half, high_half, _mm256_set1_epi16(0xff00)) can be replaced with _mm256_mask_blend_epi8(0xaaaaaaaa, low_half, high_half) in a pure AVX512BW(+VL) implementation, which is likely faster)

Nyan
  • 641
  • 5
  • 9
  • 1
    `vmovdqa64`: you compiled with AVX512 enabled. That's ok, because it doesn't look like you used any intrinsics that require AVX512, though. If you're going to show asm output, it's nice to include a permalink to the code on https://godbolt.org/ so people can go and play around with it themselves. (Use a [full-link to prevent any link-rot](https://meta.stackoverflow.com/questions/319549/how-are-we-supposed-to-post-godbolt-links-now-that-url-shortening-is-blocked/319594#319594), not a short-link). e.g. [How to convert 32-bit float to 8-bit signed char?](https://stackoverflow.com/a/51779212). – Peter Cordes Aug 12 '18 at 05:17
  • 1
    If the shift-count vector is reused many times, you can pre-calculate `1<< count_vec` and use `vpmullw` to multiply by the appropriate power of 2. For right shifts, you can do something similar with `vpmulhw`. – Peter Cordes Aug 12 '18 at 05:20
  • 1
    If you're using AVX512BW for 16-bit variable-shift, use `vpblendmb` for byte-blends (1 uop), with a mask register of alternating 0 and 1 bits. It's more efficient than AVX2 `vpblendvb` (2 uops). See my comments on https://reviews.llvm.org/D50074. Hopefully at some point LLVM will optimize `_mm256_blendv_epi8` to `vpblendmb` at compile time, especially with a constant mask. – Peter Cordes Aug 12 '18 at 05:41
  • Masking again: you only need one mask: `set1_epi32(0x000000ff)`, which you use *after* shifting instead of before. (Hmm, you might get more ILP by having another mask so an AND can run in parallel with the first shift of the count vector. But at most 2 mask vectors seem like a good idea.) – Peter Cordes Aug 12 '18 at 05:44
  • re: cost of `PBLENDW`: it's single-uop / 1c latency on all AVX2 CPUs (https://agner.org/optimize/). But Intel CPUs only run it on one port (p5) so it could be a throughput bottleneck (again see my comments on that LLVM review). Actually on AMD CPUs, it's 2 uops for the 256-bit version, because as usual they split 256-bit ops. And Bulldozer-family has 2c latency even for the cheapest vector uops. But AMD has good throughput for `pblendw`. – Peter Cordes Aug 12 '18 at 06:29
  • I'm not sure why GCC used `vmovdqa64`, my mistake, thanks (though I think `vmovdqa` is shorter than `vmovdqa64` so the choice is still odd). Good point on the right-shift - I didn't consider that. – Nyan Aug 12 '18 at 06:33
  • Yup, using EVEX for `movdqa` is a silly missed-optimization in gcc. I think I've reported this, but maybe only mentioned it as part of other missed-optimization bug reports. At least it avoids using `vpxord` and other EVEX versions of other instructions. – Peter Cordes Aug 12 '18 at 06:34
  • Added a note about `vpblendmb`, thanks for that. I wanted to stick mostly to AVX2 as that was the original question. – Nyan Aug 12 '18 at 06:37
  • Hmm I think I'm not understanding what part of the code you're referring to exactly. In the second implementation of `_mm256_sllv_epi8`, the ordering of shift/and in the `count` variable is arbitrary and can be re-arranged, before it gets fed into `_mm256_sllv_epi32`. For `_mm256_sllv_epi16`, I don't see how the `and` can be eliminated, because `a< – Nyan Aug 12 '18 at 06:52
  • Oh, I misread your code, I was thinking you were masking the count both ways, and I forgot that you'd need to mask `a`. That's where the "extra AND" is coming from, and it is needed. I don't think there are any wasted instructions after all, in your epi16 or epi8. You are just using `_mm256_srli_epi32(count, 16)` without any extra AND. Deleted my earlier comments. Adding some comments to your source code about which bits are moving where would be a good idea. It might be hard to follow for people who couldn't have answered this question themselves / didn't think through all the gotchas. – Peter Cordes Aug 12 '18 at 07:23
  • Oh I see. I'm not sure where to add comments exactly, as it seems fairly straightforward to *me* (provided you're aware of what the intrinsics do), but then, it *is* my own code so that's not a surprise. Perhaps if someone could add comments around parts they find difficult to understand, I can incorporate it into the answer. – Nyan Aug 12 '18 at 09:45
  • I find it's a good idea to be very verbose in SO answers (and not a bad idea in real SIMD code, although my comments in real code tend to also be about performance tradeoffs). I added some text and a couple comments in your answer. – Peter Cordes Aug 12 '18 at 09:59
  • And BTW, yes, `_mm256_shuffle_epi8` is probably a good option for the epi8 shifts. In-lane shuffles are great. They do compete for port 5 with `vpblendw` on Intel, but probably worth it for the middle 2 elements. (The top and bottom bytes can and should be done with a single shift or a single AND, especially on Skylake where vector shifts have 2-per-clock throughput, up from 1-per-clock on HSW/BDW. Ryzen also has two shift ports, but shuffles run on the same ports.) – Peter Cordes Aug 12 '18 at 10:02