14

The intrinsic _mm_slli_si128 will do a logical shift left of a 128 bit register, but is restricted to immediate shift values, and shifts by bytes not bits.

I can use an intrinsic like _mm_sll_epi64 or _mm_sll_epi32 to shift left a set of values within the __m128i register, but these don't carry the "overflow" bits.

For a shift by N bits imagine that I could do a something like:

  • _mm_sll_epi64
  • _mm_srr_epi64 (for the bits I want to carry: move them into the low order )
  • shuffle the srr result
  • or these together.

(but probably also have to include checks of N relative to 64).

Is there a better way?

Peeter Joot
  • 7,848
  • 7
  • 48
  • 82
  • 1
    I don't think there's any better way. I wrote up an answer to a recent duplicate of this question: http://stackoverflow.com/q/34478328/224132. For compile-time-constant counts, it turns into 4 insns, or 2 insns with count >= 64. With a variable count, it branches and has to `movd` the count and 64-count from integer to vector registers. `__uint128_t` does better in that case, if the data is already in integer registers. – Peter Cordes Dec 27 '15 at 17:04

2 Answers2

5

Not your ideal solution, but if you want to rotate or shift an SSE register by a number of bits that is a multiple of 8, then the PSHUFB instruction (and the _mm_shuffle_epi8() intrinsic) can help. It takes a second SSE register as an input; each byte in the register holds a value that is used to index the bytes in the first input register.

Jason R
  • 11,159
  • 6
  • 50
  • 81
  • 5
    I think the OP stated specifically that he wanted bit-granularity and not restricted to immediates. `_mm_shuffle_epi8()` is both byte-granularity and requires an immediate. – Mysticial Apr 02 '12 at 18:51
  • 5
    I know he wanted bit granularity, hence the first clause in my answer. Also, `_mm_shuffle_epi8()` does not require an immediate; the second argument is an `__m128i` value. [See the documentation here](http://msdn.microsoft.com/en-us/library/bb531427.aspx). – Jason R Apr 02 '12 at 18:57
  • 2
    I should note that this function requires SSSE3 support, which might not be sufficient if you want to run on older machines. – Jason R Apr 02 '12 at 18:59
  • 2
    @Mysticial: Jason is right, `pshufb` with 16 precalculated values can be used for emulating a byte wise shift by a variable. I this case one could use it for a qword wise shift (well yes by 0 or 1 qword ;-)) and the remaining 64 bit shift could be done as the OP proposes. – Gunther Piez Apr 02 '12 at 22:58
4

This came up as a side issue in a blog post (of mine) on unusual C preprocessor uses. For the 127 different shift offsets, there are four different optimal sequences of SSE2 instructions for a bit shift. The preprocessor makes it reasonable to construct a shift function that amounts to a 129-way switch statement. Pardon the raw-code here; I'm unfamiliar with posting code directly here. Check the blog post for an explanation of what's going on.

#include <emmintrin.h>

typedef __m128i XMM;
#define xmbshl(x,n)  _mm_slli_si128(x,n) // xm <<= 8*n  -- BYTE shift left
#define xmbshr(x,n)  _mm_srli_si128(x,n) // xm >>= 8*n  -- BYTE shift right
#define xmshl64(x,n) _mm_slli_epi64(x,n) // xm.hi <<= n, xm.lo <<= n
#define xmshr64(x,n) _mm_srli_epi64(x,n) // xm.hi >>= n, xm.lo >>= n
#define xmand(a,b)   _mm_and_si128(a,b)
#define xmor(a,b)    _mm_or_si128(a,b)
#define xmxor(a,b)   _mm_xor_si128(a,b)
#define xmzero       _mm_setzero_si128()

XMM xm_shl(XMM x, unsigned nbits)
{
    // These macros generate (1,2,5,6) SSE2 instructions, respectively:
    #define F1(n) case 8*(n): x = xmbshl(x, n); break;
    #define F2(n) case n: x = xmshl64(xmbshl(x, (n)>>3), (n)&15); break;
    #define F5(n) case n: x = xmor(xmshl64(x, n), xmshr64(xmbshl(x, 8), 64-(n))); break;
    #define F6(n) case n: x = xmor(xmshl64(xmbshl(x, (n)>>3), (n)&15),\
                                  xmshr64(xmbshl(x, 8+((n)>>3)), 64-((n)&155))); break;
    // These macros expand to 7 or 49 cases each:
    #define DO_7(f,x) f((x)+1) f((x)+2) f((x)+3) f((x)+4) f((x)+5) f((x)+6) f((x)+7)
    #define DO_7x7(f,y) DO_7(f,(y)+1*8) DO_7(f,(y)+2*8) DO_7(f,(y)+3*8) DO_7(f,(y)+4*8) \
                                        DO_7(f,(y)+5*8) DO_7(f,(y)+6*8) DO_7(f,(y)+7*8)
    switch (nbits) {
    case 0: break;
    DO_7(F5, 0) // 1..7
    DO_7(F1, 0) // 8,16,..56
    DO_7(F1, 7) // 64,72,..120
    DO_7x7(F6, 0) // 9..15 17..23 ... 57..63 i.e. [9..63]\[16,24,..,56]
    DO_7x7(F2,56) // 65..71 73..79 ... 121..127 i.e. [65..127]\[64,72,..,120]
    default: x = xmzero;
    }
    return x;
}

xm_shr amounts to the above but swapping "shl" and "shr" everywhere in the F[1256] macros. HTH.

foool
  • 1,462
  • 1
  • 15
  • 29
Mischa
  • 2,240
  • 20
  • 18
  • 2
    Actually, the above code does not work for about half of the shift values. I tested it against a standard shift on 128-bit integers (gcc supports __uint128_t), and the results are markedly different. For example, all shifts above 120 just zero all bits. – seba Nov 09 '15 at 00:10
  • 1
    For compile-time-constant shift count, you never need more than 4 instructions (or 5 without AVX: an extra `movdqa`). For count < 64, byte-shift left by 64b, then bit shift that carry right by 64-count. `OR` the carry with `psllq xmm0, 64`. I wrote it with an `if`, and it compiles away nicely for a compile-time constant count: http://goo.gl/O14GhI. See http://stackoverflow.com/a/34482688/224132 – Peter Cordes Dec 27 '15 at 17:13
  • To fix the code, just replace every &15 or &155 expression by &7. This said, this code is very slow (do you know about branching?!), and Peter Cordes suggestion looks far more promising. – Claude Brisson Jan 03 '16 at 13:36
  • :-) The code originated as an example on stretching the use of the c preprocessor for template-like code with a single switch. Thanks for looking at it in depth. A while back I found that using straight int64's with variable shift counts was faster. But I'll fix up the example code. – Mischa Jan 04 '16 at 16:54