15

I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Filippo Bistaffa
  • 551
  • 3
  • 16
  • possible duplicate of [How to perform element-wise left shift with \_\_m128i?](http://stackoverflow.com/questions/11148833/how-to-perform-element-wise-left-shift-with-m128i) – Salgar Jul 12 '13 at 08:36
  • 5
    @Salgar: no, that's not a duplicate - this question is about shifting an entire 128 bit vector whereas the question you cite as a dupe is about element-wise shifting. – Paul R Jul 12 '13 at 09:10

1 Answers1

17

This is the best that I could come up with for left/right immediate shifts with SSE2:

#include <stdio.h>
#include <emmintrin.h>

#define SHL128(v, n) \
({ \
    __m128i v1, v2; \
 \
    if ((n) >= 64) \
    { \
        v1 = _mm_slli_si128(v, 8); \
        v1 = _mm_slli_epi64(v1, (n) - 64); \
    } \
    else \
    { \
        v1 = _mm_slli_epi64(v, n); \
        v2 = _mm_slli_si128(v, 8); \
        v2 = _mm_srli_epi64(v2, 64 - (n)); \
        v1 = _mm_or_si128(v1, v2); \
    } \
    v1; \
})

#define SHR128(v, n) \
({ \
    __m128i v1, v2; \
 \
    if ((n) >= 64) \
    { \
        v1 = _mm_srli_si128(v, 8); \
        v1 = _mm_srli_epi64(v1, (n) - 64); \
    } \
    else \
    { \
        v1 = _mm_srli_epi64(v, n); \
        v2 = _mm_srli_si128(v, 8); \
        v2 = _mm_slli_epi64(v2, 64 - (n)); \
        v1 = _mm_or_si128(v1, v2); \
    } \
    v1; \
})

int main(void)
{
    __m128i va = _mm_setr_epi8(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f);
    __m128i vb, vc;

    vb = SHL128(va, 4);
    vc = SHR128(va, 4);

    printf("va = %02vx\n", va);
    printf("vb = %02vx\n", vb);
    printf("vc = %02vx\n", vc);
    printf("\n");

    vb = SHL128(va, 68);
    vc = SHR128(va, 68);

    printf("va = %02vx\n", va);
    printf("vb = %02vx\n", vb);
    printf("vc = %02vx\n", vc);

    return 0;
}

Test:

$ gcc -Wall -msse2 shift128.c && ./a.out
va = 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
vb = 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0
vc = 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0 00

va = 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
vb = 00 00 00 00 00 00 00 00 00 10 20 30 40 50 60 70
vc = 90 a0 b0 c0 d0 e0 f0 00 00 00 00 00 00 00 00 00
$ 

Note that the SHL128/SHR128 macros are implemented using a gcc extension supported by gcc, clang and some other compilers, but these will need to be adapted if your compiler does not support this extension.

Note also that the printf extension for SIMD types used in the test harness works with Apple gcc, clang, et al, but again if your compiler does not support this and you want to test the code you'll need to implement your own SIMD print routines.

Note on performance - the if/else branch will get optimised out so long as n is a compile-time constant (which it needs to be anyway for the shift intrinsics) so you have 2 instructions for the n >= 64 case and 4 instructions for the n < 64 case.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • You can just look at the values in your debugger, or you can write some debug print utilities for SIMD data types - they will probably be useful if you plan to get involved with writing SSE code. – Paul R Jul 12 '13 at 13:17
  • 1
    I'd like to share with you my version for n > 64 bits: `#define SHL(v, n) ({ \ register __m128i x; register int m; \ if (n > 64) { x = _mm_slli_si128(v, 8); \ m = n - 64; } else { x = v; m = n; } \ register __m128i v1 = _mm_slli_epi64(x, m); \ register __m128i v2 = _mm_slli_si128(x, 8); \ v2 = _mm_srli_epi64(v2, 64 - (m)); \ v1 = _mm_or_si128(v1, v2); \ })` Can it be improved? – Filippo Bistaffa Jul 14 '13 at 10:00
  • It's not easy to read code in a comment - maybe you could post this as a new question ? – Paul R Jul 14 '13 at 12:55
  • OK - that's not very efficient - you do more operations than are necessary when n > 64. – Paul R Jul 14 '13 at 13:53
  • So how would you improve that? – Filippo Bistaffa Jul 14 '13 at 17:24
  • OK - I've updated the answer to handle shifts > 64 bits (see above). – Paul R Jul 14 '13 at 19:58
  • Result of `SHL128(va, 2);` looks correct, but results of others `SHR128(va, 4);`, `SHL128(va, 68);`, `SHR128(va, 68);` - look wrong: http://coliru.stacked-crooked.com/a/9725c1f7912efbc0 – Alex Feb 17 '18 at 11:36
  • 1
    @Alex: could be a problem with your `m128i_to_bitstr` function ? You're using a `uint16_t *` to index into the `__m128i` but treating each 16 bit element as a `std::bitset<8>` ? Maybe you should be indexing this as 16 x `uint8_t` ? ? (I haven't looked at this properly yet so apologies if I missed something.) – Paul R Feb 19 '18 at 10:26
  • 1
    @Paul R Thanks! Yes, your shift-code is correct, there was an error in my print code, now everything works correctly: http://coliru.stacked-crooked.com/a/a9dbd34a8940082b – Alex Feb 19 '18 at 11:14