SSE2 shift by vector

Question

I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector.

To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that:

r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn

since it appears that _mm_sll_epi* performs this:

r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1

Thanks in advance.

EDIT:

Here's the code I have:

#include <iostream>

#include <cstdint>

#include <mmintrin.h>
#include <emmintrin.h>

namespace SIMD {

    using namespace std;

    class SSE2 {
    public:
        // flipped operands due to function arguments
        SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d) { low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); }

        uint64_t& operator[](int idx)
        {
            switch (idx) {
            case 0:
                _mm_storel_epi64((__m128i*)result, low);
                return result[0];
            case 1:
                _mm_store_si128((__m128i*)result, low);
                return result[1];
            case 2:
                _mm_storel_epi64((__m128i*)result, high);
                return result[0];
            case 3:
                _mm_store_si128((__m128i*)result, high);
                return result[1];
            }

            /* Undefined behaviour */
            return 0;
        }

        SSE2& operator<<=(const SSE2& rhs)
        {
            low  = _mm_sll_epi64(low,  rhs.getlow());
            high = _mm_sll_epi64(high, rhs.gethigh());

            return *this;
        }

        void print()
        {
            uint64_t a[2];
            _mm_store_si128((__m128i*)a, low);

            cout << hex;
            cout << a[0] << ' ' << a[1] << ' ';

            _mm_store_si128((__m128i*)a, high);

            cout << a[0] << ' ' << a[1] << ' ';
            cout << dec;
        }

        __m128i getlow() const
        {
            return low;
        }

        __m128i gethigh() const
        {
            return high;
        }
    private:
        __m128i low, high;
        uint64_t result[2];
    };
}

int main()
{
    cout << "operator<<= test: vector << vector: ";
    {
        auto x = SIMD::SSE2(7, 8, 15, 10);
        auto y = SIMD::SSE2(4, 5,  6,  7);

        x.print();
        y.print();

        x <<= y;

        if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) {
            cout << "FAILED: ";
            x.print();
            cout << endl;
        } else {
            cout << "PASSED" << endl;
        }
    }

    return 0;
}

What should be happening gets results of {7 << 4 = 112, 8 << 5 = 256, 15 << 6 = 960, 10 << 7 = 1280}. The results seem to be {7 << 4 = 112, 8 << 4 = 128, 15 << 6 = 960, 15 << 6 = 640}, which isn't what I want.

Hope this helps, Jens.

Please, please, how is anybody supposed to know what you are doing if you don't give us the declarations of your variables? Please read the help text in the top menu first to learn how to ask questions, and then the informative text of the "c" tag (just click on it). Then see http://stackoverflow.com/help/mcve. — Jens Gustedt, Jul 27 '16 at 07:31
@JensGustedt I typo'd when tagging it, it should be c++. But here you go anyway. — ZirconiumX, Jul 27 '16 at 07:54
Variable, element-granular shifts are part of AVX2, not SSE2. Also, they only exist for 32/64-bit element sizes. You can implement element-granular left-shift by multiplication instead. — EOF, Jul 27 '16 at 07:55
@EOF My elements are 64-bit, though. And the element-granular left-shift doesn't help with right-shift, which I also need. — ZirconiumX, Jul 27 '16 at 07:57
Looks like you're out of luck then. You'll probably have to move the vector elements to the integer registers, shift them and move them back, or split them into separate vector-register and merge them back after the shift. — EOF, Jul 27 '16 at 08:03
Possible duplicate of [Shifting 4 integers right by different values SIMD](http://stackoverflow.com/questions/38363423/shifting-4-integers-right-by-different-values-simd) — Peter Cordes, Jul 27 '16 at 08:08
@EOF: not quite: you can shift left by variable amounts with a multiply, then shift right by a constant. But with only two 64bit elements, it's actually probably better to just shift twice and blend (with `movhps` or something) — Peter Cordes, Jul 27 '16 at 08:10
@PeterCordes: Yes, you can variable-shift left by multiplication and shift back right by a scalar afterwards, but since there's not even a 64-bit integer multiply that's probably horribly inefficient. — EOF, Jul 27 '16 at 08:12
@EOF: Hrm, yes. I retracted my close-as-duplicate vote, but my answer on the linked question DOES actually focus on shifting and blending. (With SSE4.1 `pblendw`) — Peter Cordes, Jul 27 '16 at 08:16
Why is `uint64_t result[2];` a class member instead of a local in `operator[]`?!?!? And why does it return a (non-const!) reference to the private class member, instead of just returning by value? It'd be easier to just make your members public, if your class is basically just there to provide some operators. Not much harm in letting users do other stuff without overhead. Also, you could use `setr` functions if you want your constructor to take args in low-to-high order. e.g. `low = _mm_setr_epi64x(a,b)` — Peter Cordes, Jul 27 '16 at 08:19
@PeterCordes result is a class member because reference to a local goes out of scope, causing a segfault if you use the value. My full code does also have a const operator[], but I omitted it for brevity. I did consider a return by value, and may use that, though I'm not sure whether compilers will treat it differently. I didn't know about setr, I'll use those instead. — ZirconiumX, Jul 27 '16 at 09:00
@ZirconiumX: Returning a `uint64_t` is obviously the right thing to do in that case. Of course it will be treated differently from returning a reference to a local, and that's what you want. It will be like any other function that returns an `int` or whatever. The vector store and scalar load happen inside the function (except in reality it will be inlined), instead of just returning a pointer to the memory you stored into. — Peter Cordes, Jul 27 '16 at 09:09
@EOF: I feel like there should be a more efficient SSE2 way to combine the low 64 of one register with the high 64 of another, with fewer than three shuffle instructions for the whole function if we want to avoid FP instructions, but I'm not having any ideas for doing something differently. It's pretty easy to bottleneck on shuffle throughput if you're not careful, so even some alternative that was still more instructions than `pblendw` but didn't need port5 would be ok. — Peter Cordes, Jul 27 '16 at 09:41
@PeterCordes: There is a sane way to combine vectors: `pinsr[b/w/d/q]`, but the `word` variant is `ssse3`, and the others are `sse4.1`. I don't see why you're against using the FP-instructions `movhlps` and `movlhps`, which are `SSE` and will probably be reasonably fast even if you add a FP->int bypass delay. — EOF, Jul 27 '16 at 10:18
@EOF: The pins* instructions don't have two-vector forms. Maybe you're thinking of SSE4 `insertps`, which does? Also [`pinsrw` is SSE2](http://www.felixcloutier.com/x86/PINSRW.html). For this case the reg-reg form of `movsd` is more useful than `movhlps`. In a lot of cases it's probably fine, and better than two integer shuffles, but remember that you pay the bypass delay twice if it's between two integer instructions on the critical path. So even a 1c delay is worse than two integer shuffles. And Nehalem has a 2c delay. In a program with an AVX path and a fallback, this only runs there — Peter Cordes, Jul 27 '16 at 10:40

score 3 · Accepted Answer · edited May 23 '17 at 12:00

If AVX2 is available, and your elements are 32 or 64 bits, your operation takes one variable-shift instruction: vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count) )

For 32bit elements with SSE4.1, see Shifting 4 integers right by different values SIMD. Depending on latency vs. throughput requirements, you can do separate shifts shift and then blend, or use a multiply (by a specially-constructed vector of powers of 2) to get variable-count left shifts and then do a same-count-for-all-elements right shift.

For your case, 64bit elements with runtime-variable shift counts:

There are only two elements per SSE vector, so we just need two shifts and then combine the results (which we can do with a pblendw, or with a floating-point movsd (which may cause extra bypass-delay latency on some CPUs), or we can use two shuffles, or we can do two ANDs and an OR.

__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)
{
    __m128i shift_low = _mm_srl_epi64(a, count);          // high 64 is garbage
    __m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
    __m128i shift_high = _mm_srl_epi64(a, count_high);    // low 64 is garbage
    // SSE4.1:
    // return _mm_blend_epi16(shift_low, shift_high, 0x0F);

#if 1   // use movsd to blend
    __m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) );  // use movsd as a blend.  Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
    return _mm_castpd_si128(blended);
#else  // SSE2 without using FP instructions:
    // if we're going to do it this way, we could have shuffled the input before shifting.  Probably not helpful though.
    shift_high = _mm_unpackhi_epi64(shift_high, shift_high);       // broadcast the high64
    return       _mm_unpacklo_epi64(shift_high, shift_low);        // combine
#endif
}

Other shuffles like pshufd or psrldq would work, but punpckhqdq gets the job done without needing an immediate byte, so it's one byte shorter. SSSE3 palignr could get the high element from one register and the low element from another register into one vector, but they'd be reversed (so we'd need a pshufd to swap high and low halves). shufpd would work to blend, but has no advantage over movsd.

See Agner Fog's microarch guide for the details of the potential bypass-delay latency from using an FP instruction between two integer instructions. It's probably fine on Intel SnB-family CPUs, because other FP shuffles are. (And yes, movsd xmm1, xmm0 runs on the shuffle unit in port5. Use movaps or movapd for reg-reg moves even of scalars if you don't need the merging behaviour).

This compiles (on Godbolt with gcc5.3 -O3) to

    movdqa  xmm2, xmm0  # tmp97, a
    psrlq   xmm2, xmm1    # tmp97, count
    punpckhqdq      xmm1, xmm1  # tmp99, count
    psrlq   xmm0, xmm1    # tmp100, tmp99
    movsd   xmm0, xmm2    # tmp102, tmp97
    ret

SSE2 shift by vector

1 Answers1

For your case, 64bit elements with runtime-variable shift counts:

Linked