The best way to shift a __m128i?

Question

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?!

Note that _mm_slli_epi64 shifts v0 and v1 seperately:

r0 := v0 << count
r1 := v1 << count

so the last bits of v0 missed, but I want to move those bits to r1.

Edit: I looking for a code, faster than this (m<64):

r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

If `m` happens to be a multiple of 8 bits and you have SSSE3, you're in luck: `palignr`. If not, it gets ugly fast and you really, really need to do shifts, ANDs, shuffles and ORs. — Iwillnotexist Idonotexist, Dec 27 '15 at 07:09
See http://stackoverflow.com/questions/9980801/looking-for-sse-128-bit-shift-operation-for-non-immediate-shift-value — Craig Estey, Dec 27 '15 at 07:22
Are you processing bit streams, or arithmetic variables (ints, floats, etc)? — bazza, Dec 27 '15 at 07:37
@user0, the answer I was going to propose won't be of any use, sorry. — bazza, Dec 27 '15 at 07:46
duplicate of http://stackoverflow.com/questions/9980801/looking-for-sse-128-bit-shift-operation-for-non-immediate-shift-value, but that doesn't have a good answer. — Peter Cordes, Dec 27 '15 at 15:06
Is your shift count a compile-time constant (after inlining)? You mentioned _mm_slli_epi64. Is your data already in SSE registers? Do you need to keep the bits shifted out the left side of the register? Your C code for the two 64bit halves generates an `r2 = v1 >> (64-m)`, but that won't be part of the `__m128i` result you're asking for. — Peter Cordes, Dec 27 '15 at 15:13
@PeterCordes I already wrote the code with structres. If there was a faster coder by __mm128i, I'll recode by SSE's. Above code is pseudocode of what I need. — user0, Dec 27 '15 at 15:26
@MarcGlisse These part of code repeated many many times and a small improvment in speed is very nice! — user0, Dec 27 '15 at 15:28
@user0: I understood that. The question is whether you need `r2`. Do you need your shifts to carry from one `__m128i` to another? Or did you just include `r2` for completeness? — Peter Cordes, Dec 27 '15 at 15:50
@PeterCordes There is one __m128i. r0 and r1 are segments of the result of shifting. — user0, Dec 27 '15 at 15:52

Peter Cordes · Answer 1 · 2015-12-27T17:11:28.317

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.

This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.

// untested
#include <immintrin.h>

/* some compilers might choke on slli / srli with non-compile-time-constant args
 * gcc generates the   xmm, imm8 form with constants,
 * and generates the   xmm, xmm  form with otherwise.  (With movd to get the count in an xmm)
 */

// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
    __m128i carry = _mm_bslli_si128(x, 8);   // old compilers only have the confusingly named _mm_slli_si128 synonym
    if (count >= 64)
        return _mm_slli_epi64(carry, count-64);  // the non-carry part is all zero, so return early
    // else
    carry = _mm_srli_epi64(carry, 64-count);  // After bslli shifted left by 64b

    x = _mm_slli_epi64(x, count);
    return _mm_or_si128(x, carry);
}

__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
    return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100);  }

I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.

The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):

mm_bitshift_left_3(long long __vector(2)):
        vpslldq xmm1, xmm0, 8
        vpsrlq  xmm1, xmm1, 61
        vpsllq  xmm0, xmm0, 3
        vpor    xmm0, xmm0, xmm1
        ret

Performance:

This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)

Or for counts >= 64:

mm_bitshift_left_100(long long __vector(2)):
        vpslldq xmm0, xmm0, 8
        vpsllq  xmm0, xmm0, 36
        ret

If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.

It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.

The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)

__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
    return x << count;  // undefined if count >= 128
}

Thank you very much. Unfortunately `count` is not a compile-time constant. However I'll test both suggestions. — user0, Dec 28 '15 at 15:16
According to my tests, my old code written by 4 `int64_t` vars is faster (>2times) for randomly generated `count`; but for compile-time constant `count`, `mm_bitshift_left` is at least 1.5 times faster. — user0, Dec 28 '15 at 15:19
@user0: I'm not suprised. In a real app, I'd expect there to be a little bit of predictability in shift counts. Also, did your microbench test *just* the shift, or did it test the shift as an operation in between two other vector intrinsics? In that case, `int64_t` shift would have to get the values from vector to GP regs and back. (I think I said in my answer that if your data isn't already in vector regs, `__uint128` shift (or it's hand-written equivalent with `int64_t`) should do well.) — Peter Cordes, Dec 28 '15 at 18:37
Time measured just for shifting operations. I'm going to test `__uint128`. — user0, Dec 29 '15 at 06:29
Yes! `__uint128` is faster than others. It is at least 1.5 times faster than `int64_t` method for random `count`. But it seems some machines don't support 128 integers. — user0, Dec 29 '15 at 06:46
@user0: It's a compiler extension. When you say "some machines", you mean some compile hosts, not some targets. The machine instructions emitted when compiling code that uses `__uint128t` are just standard add-with-carry, double-shift, etc. that are baseline for x86. — Peter Cordes, Dec 29 '15 at 15:50

Aki Suihkonen · Answer 2 · 2015-12-27T07:44:50.880

1

In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

edited Dec 27 '15 at 07:44

answered Dec 27 '15 at 07:28

Aki Suihkonen

19,144
1
36
57

Please see revised answer. There's no p in the right instruction. – Aki Suihkonen Dec 27 '15 at 07:45
3

The big caveat appears to be that its AMD only. – Aki Suihkonen Dec 27 '15 at 09:34

The best way to shift a __m128i?

2 Answers2

Linked