2

I came access this post whilst doing research for my next project. Being able to bit shift 8 and 16-bit integers by vector using SIMD would be very useful to me and I think many other people here.

Unfortunately for me, the platform my project will be running on will have at most SSE2 capabilities.

Swapping the

 _mm256_*** 

with

 _mm_*** 

is not gonna cut it as

 _mm_shuffle_epi8() //Requires SSSE3 
 _mm_blendv_epi8()  //Requires SSE4.1
 _mm_blend_epi16()  //Requires SSE4.1
 _mm_sllv_epi32()   //Requires AVX2

So you see my dilemma. It may be impossible to achieve with just SSE2, but I would be very happy (and frankly amazed) to by proven wrong.

Thanks in advance.

dave_thenerd
  • 448
  • 3
  • 10
  • 1
    SSSE3 `pshufb` (`_mm_shuffle_epi8`) is a pretty fundamental building block. Replacing its functionality as a parallel lookup table isn't efficiently possible in general, and I don't know any obvious way to generate `1< – Peter Cordes Oct 13 '22 at 04:31
  • 1
    The majority of systems without AVX have at least SSSE3, especially x86-64, although there are presumably a few AMD Phenom II systems still in operation. Possibly even Nocona P4. Intel's first x86-64 after pentium-4 was Core2, with SSSE3 (which has slow pshufb, but not hugely slow; probably still better than falling back to scalar). If you have a few different shift counts, you can `_mm_slli_epi16` and manually blend (andnot/and/or). But for that to be non-terrible, you might have to JIT a sequence of instructions and masks for a specific set of shift counts. – Peter Cordes Oct 13 '22 at 04:34
  • 1
    In some cases you can simulate `_mm_sllv_epi32` by converting to `float` (or `double`), adding something to the exponent and converting back to `int32`. This won't be efficient, if you need to cover all cases (e.g. inputs with more than 23 significant digits or results which overflow). – chtz Oct 13 '22 at 06:56
  • `_mm_sra*` and `_mm_srl*`, https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE2 – Hans Passant Oct 13 '22 at 16:05
  • @HansPassant: Are you talking about `_mm_sll_epi16(vec, vec)` which uses the low 64 bits of the 2nd vec as the count for each element? The OP linked another Q&A which asks about emulating AVX-512 `_mm256_sllv_epi16`, so it seems their "by vector" means with a different count for every element. (Unless they didn't know about those single-count-multiple-data shifts.) – Peter Cordes Oct 15 '22 at 13:04

3 Answers3

4

Not the nicest code going, and I can't really say if it's better or worse than processing each element as uint16. You could save a few ops if you ensure the bit shift amount is always < 16, but it's still not great.

__m128i sllv_epi16(__m128i v, __m128i s) {

    // test each bit I the shift
    const __m128i _1  = _mm_set1_epi16(1);
    const __m128i _2  = _mm_set1_epi16(2);
    const __m128i _4  = _mm_set1_epi16(4);
    const __m128i _8  = _mm_set1_epi16(8);

    // testing to set to zero if 16 or greater
    const __m128i _16 = _mm_set1_epi16(16);
    s = _mm_min_epi16(s, _16);

    // mask out each bit in the shift amount
    __m128i cmp1  = _mm_and_si128(s, _1);
    __m128i cmp2  = _mm_and_si128(s, _2);
    __m128i cmp4  = _mm_and_si128(s, _4);
    __m128i cmp8  = _mm_and_si128(s, _8);
    __m128i cmp16 = _mm_cmpeq_epi16(_16, s);

    // convert each bit into a true/false mask
    cmp1 = _mm_cmpeq_epi16(_1, cmp1);
    cmp2 = _mm_cmpeq_epi16(_2, cmp2);
    cmp4 = _mm_cmpeq_epi16(_4, cmp4);
    cmp8 = _mm_cmpeq_epi16(_8, cmp8);

    // shift by 1 bit, select result
    __m128i shift1 = _mm_slli_epi16(v, 1);
    v = _mm_or_si128(_mm_andnot_si128(cmp1, v), 
                     _mm_and_si128(cmp1, shift1));

    // shift by 2 bits, select result
    __m128i shift2 = _mm_slli_epi16(v, 2);
    v = _mm_or_si128(_mm_andnot_si128(cmp2, v),
                     _mm_and_si128(cmp2, shift2));

    // shift by 4 bits, select result
    __m128i shift4 = _mm_slli_epi16(v, 4);
    v = _mm_or_si128(_mm_andnot_si128(cmp4, v),
                     _mm_and_si128(cmp4, shift4));

    // shift by 8 bits, select result
    __m128i shift8 = _mm_slli_epi16(v, 8);
    v = _mm_or_si128(_mm_andnot_si128(cmp8, v),
                     _mm_and_si128(cmp8, shift8));

    // filter out shifts >= 16.
    return _mm_andnot_si128(cmp16, v); 
}

and for 8 bit

__m128i sllv_epi8(__m128i v, __m128i s) {
    
    const __m128i _1 = _mm_set1_epi8(1);
    const __m128i _2 = _mm_set1_epi8(2);
    const __m128i _4 = _mm_set1_epi8(4);
    const __m128i _8 = _mm_set1_epi8(8);
    s = _mm_min_epu8(s, _8);

    __m128i cmp1 = _mm_and_si128(s, _1);
    __m128i cmp2 = _mm_and_si128(s, _2);
    __m128i cmp4 = _mm_and_si128(s, _4);
    __m128i cmp8 = _mm_cmpeq_epi8(_8, s);

    cmp1 = _mm_cmpeq_epi8(_1, cmp1);
    cmp2 = _mm_cmpeq_epi8(_2, cmp2);
    cmp4 = _mm_cmpeq_epi8(_4, cmp4);

    __m128i shift1 = _mm_slli_epi16( _mm_and_si128(v, _mm_set1_epi8(0x7F)), 1);
    v = _mm_or_si128(_mm_andnot_si128(cmp1, v), 
                     _mm_and_si128(cmp1, shift1));

    __m128i shift2 = _mm_slli_epi16(_mm_and_si128(v, _mm_set1_epi8(0x3F)), 2);
    v = _mm_or_si128(_mm_andnot_si128(cmp2, v),
                     _mm_and_si128(cmp2, shift2));

    __m128i shift4 = _mm_slli_epi16(_mm_and_si128(v, _mm_set1_epi8(0x0F)), 4);
    v = _mm_or_si128(_mm_andnot_si128(cmp4, v),
                     _mm_and_si128(cmp4, shift4));

    return _mm_andnot_si128(cmp8, v); 
}
robthebloke
  • 9,331
  • 9
  • 12
  • For `sllv_epi8` you can save 1 uop by using `shift1 = _mm_add_epi8(v,v)` and in `sllv_epi16` you probably want to use `s = _mm_min_epu16(s, _16);` (instead of `_epi16`). I think for 16bit you can get better throughput using float conversion (I'll post an answer later), but I assume overall, these should at least be better than storing to memory, looping and reading back from memory. – chtz Oct 13 '22 at 19:40
  • Ok, `_mm_min_epu16` is only available with SSE4.1, and apparently Soonts was faster with writing a FP-based solution :) – chtz Oct 13 '22 at 19:51
3

Here’s another approach for uint16_t lanes. The latency is probably worse than the answer by robthebloke, because the instructions which convert int32<->fp32 take 3 (AMD) or 4 (Intel) cycles on modern CPU, and the function has two of them on the dependency chain.

But throughput might be slightly better, fewer instructions to run.

// Shift int16_t lanes left or right, while shifting in zeros
template<bool leftShift, bool validateShiftAmount = true>
inline __m128i shiftLeftRight_epi16( __m128i vec, __m128i shift )
{
    if constexpr( validateShiftAmount )
    {
        shift = _mm_max_epi16( shift, _mm_setzero_si128() );
        shift = _mm_min_epi16( shift, _mm_set1_epi16( 16 ) );
    }

    // Unpack uint16_t lanes into uint32_t, even/odd lanes in 2 vectors
    const __m128i lowMask = _mm_set1_epi32( 0xFFFF );
    __m128i low = _mm_and_si128( vec, lowMask );
    __m128i high = _mm_srli_epi32( vec, 16 );

    // Convert both numbers to FP32
    low = _mm_castps_si128( _mm_cvtepi32_ps( low ) );
    high = _mm_castps_si128( _mm_cvtepi32_ps( high ) );

    // Unpack uint16_t lanes with shift amount, in the exponent field
    __m128i shiftHigh = _mm_andnot_si128( lowMask, shift );
    __m128i shiftLow = _mm_slli_epi32( shift, 23 );
    shiftHigh = _mm_slli_epi32( shiftHigh, 23 - 16 );

    // Apply offset to the FP32 exponent
    if constexpr( leftShift )
    {
        low = _mm_add_epi32( low, shiftLow );
        high = _mm_add_epi32( high, shiftHigh );
    }
    else
    {
        low = _mm_sub_epi32( low, shiftLow );
        high = _mm_sub_epi32( high, shiftHigh );
    }

    // Convert numbers back to integers;
    // cvttps2dq truncates to zero, ignoring MXCSR rounding modes
    low = _mm_cvttps_epi32( _mm_castsi128_ps( low ) );
    high = _mm_cvttps_epi32( _mm_castsi128_ps( high ) );

    // Assemble the complete vector from the two pieces
    low = _mm_and_si128( low, lowMask );
    high = _mm_slli_epi32( high, 16 );
    return _mm_or_si128( low, high );
}

inline __m128i sllv_epi16( __m128i vec, __m128i shift )
{
    return shiftLeftRight_epi16<true>( vec, shift );
}
inline __m128i srlv_epi16( __m128i vec, __m128i shift )
{
    return shiftLeftRight_epi16<false>( vec, shift );
}

About 8-bit lanes, while possible to reduce to two shifts of two vectors of 16-bit lanes, I think that gonna be too many instructions to run. For that use case, I would probably use the version in another answer.

Soonts
  • 20,079
  • 9
  • 57
  • 130
2

Variable bit shift of 16 bit values can be done quite easily by multiplication; for left shift it's _mm_mullo_epi16(input, one_hot(bits)), for right shift it's _mm_mulhi_epu16(input, one_hot(16-bits));

On SSSE3, one_hot would optimally use pshufb to get 8 bit shift; then we would only require one post shift by 8, if input bit 3 was set -- and here the vector of shifts would optimally be uint8_t shift.

On SSE2, we seem to have a chicken-egg problem; but with multiplication we can get slightly better/fewer constants and we can have shorter dependency chain.

// as long as we have even number of multiplies, we
// can as well multiply by negative values
// a *= (mask & 1 ? -2 : -1) * (mask & 2 ? -4 : -1) *
        (mask & 4 ? -16 : -1) * (mask & 8 ? -256 : -1);

__m128i product_1 = generate_1_or_2(shift_vec);
__m128i product_2 = generate_1_or_4(shift_vec);
__m128i product_4 = generate_1_or_16(shift_vec);
__m128i product_8 = generate_1_or_256(shift_vec);

__m128i p12 = _mm_mullo_epi16(product_1, product_2);
__m128i p48 = _mm_mullo_epi16(product_4, product_8);
__m128i p1248 = _mm_mullo_epi16(p12, p48);
return _mm_mullo_epi16(a, p1248);

Having multiple independent products, and due to commutativity of multiplication, we can choose either to multiply the input or we can multiply some previous product.

We can also premultiply a or vec by one of the constants as in

__m128i p1 = _mm_srai_epi16(_mm_slli_epi16(shift_vec, 15), 15);
p1 = _mm_add_epi16(_mm_and_si128(p1, vec), vec);
     
__m128i product_2 = generate_1_or_4(shift_inv);
__m128i product_4 = generate_1_or_16(shift_inv);
__m128i product_8 = generate_1_or_256(shift_inv);

return _mm_mullo_epi16(_mm_mullo_epi16(p1,p2), _mm_mullo_epi16(p4,p8));

which would have only 2 multiplications on the critical path.

It's also possible to have an even number of those constants negative, if those constants are easier to generate.

template <int N>
__m128i generate_minus_1_or_mask(__m128i a) {
    __m128i a = _mm_xor_si128(a, _mm_set1_epi16(-1));
    a = _mm_slli_epi16(a, 15 - N);
    a = _mm_srai_epi16(a, 15);
    return _mm_or_si128(a, _mm_set1_epi16(-(1<<(1<<N))));
}

The inversion should be shared between all the instances, and the rest should give just three instructions (the last instruction being a por xmm0, xmmword ptr [rip + .LCPI0_0])

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Very good idea, especially if the shift-vector actually is known at compile-time. But I don't think `_mm_or_si128(shift_inv >> 1, _mm_set1_epi16(-2)) << 1` etc, are correct, you will get `~mask & 1 ? -2 : -4` instead of ` -1:-4` (shift left does not fill with 1s). You could shift the inverted mask to the left by 15, 14, etc, then do an arithmetic shift by 15 and then bit-or it with -2, -4, -16, etc. – chtz Oct 14 '22 at 08:58
  • Right you are. Even if they produce the proper values of -2, -4, -16, -256, they don't produce the corresponding values of -1,-1,-1,-1. – Aki Suihkonen Oct 14 '22 at 14:43
  • `p1 = shift_inv | _mm_set1_epi16(-2)` and `p2 = _mm_srai_epi16(_mm_slli_epi16(shift_inv, 14), 15) | _mm_set1_epi16(-4)` should work (`p4 = ((s<<13) >> 15) | -16; p8 = ((s<<12) >> 15) | -256;`) – chtz Oct 14 '22 at 16:20
  • 1
    If you wanted a positive `p1`, then `p1 = (shift_vec & 1) + 1` would also be just 2 uops. – chtz Oct 14 '22 at 16:27