implications of using _mm_shuffle_ps on integer vector

Question

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:

#include <iostream>
#include <immintrin.h>
#include <sstream>

using namespace std;

template <typename T>
std::string __m128i_toString(const __m128i var) {
    std::stringstream sstr;
    const T* values = (const T*) &var;
    if (sizeof(T) == 1) {
        for (unsigned int i = 0; i < sizeof(__m128i); i++) {
            sstr << (int) values[i] << " ";
        }
    } else {
        for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
            sstr << values[i] << " ";
        }
    }
    return sstr.str();
}



int main(){

  cout << "Starting SSE test" << endl;
  cout << "integer shuffle" << endl;

 int A[] = {1,  -2147483648, 3, 5};
 int B[] = {4, 6, 7, 8};

  __m128i pC;

  __m128i* pA = (__m128i*) A;
  __m128i* pB = (__m128i*) B;

  *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
  pC = _mm_add_epi32(*pA,*pB);

  cout << "A[0] = " << A[0] << endl;
  cout << "A[1] = " << A[1] << endl;
  cout << "A[2] = " << A[2] << endl;
  cout << "A[3] = " << A[3] << endl;

  cout << "B[0] = " << B[0] << endl;
  cout << "B[1] = " << B[1] << endl;
  cout << "B[2] = " << B[2] << endl;
  cout << "B[3] = " << B[3] << endl;

  cout << "pA = " << __m128i_toString<int>(*pA) << endl;
  cout << "pC = " << __m128i_toString<int>(pC) << endl;
}

Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):

vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd  16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call    __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....

Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

Accessing the SSE/AVX-registers with an incompatible type *may* hurt performance. (Only on the newest intel processors AFAIK) — EOF, Nov 17 '14 at 23:01
See the comments in this Question [difference-between-the-avx-instructions-vxorpd-and-vpxor](https://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor). Particually the first one by Mysticial — Z boson, Nov 18 '14 at 08:32
[mm-shuffle-ps-equivalent-for-integer-vectors-m128i](https://stackoverflow.com/questions/13153584/mm-shuffle-ps-equivalent-for-integer-vectors-m128i). — Z boson, Nov 18 '14 at 13:44
I was aware of that similar question, though there the question was to have a functional equivalent, my question uses the shuffle instruction as example. — hbogert, Nov 18 '14 at 15:12

score 6 · Accepted Answer · edited Jul 29 '17 at 03:49

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

or change to the floating point domain like this

*pA = _mm_castps_si128( 
        _mm_shuffle_ps(_mm_castsi128_ps(*pA), 
                       _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));

But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner

The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.

You have to test your code and see which method above is more efficient.

Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:

For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].

Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.

The reverse (integer shuffles between FP math instructions) is not as safe:

In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).

On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

I've updated my example code and added a snippet of produced assembly. So, if you have e.g. `vshufps $228, 16(%rsp), %xmm1, %xmm0;vpaddd 16(%rsp), %xmm0, %xmm2` where vhufps is the FLOAT domain and VPADDD is in the INT domain, then internally the CPU will do bypasses, thus in this specific at least for xmm0? — hbogert, Nov 18 '14 at 12:38
@hbogert, sorry, I did not pay enough attention. `_mm_shuffle_ps` and `_mm_shuffle_epi32` are not equivalent. I updated my answer. — Z boson, Nov 18 '14 at 14:16
@hbogert, you can do this with SSE4.1 using `*pA = _mm_blend_epi16(*pA, *pB, 0xf0);` — Z boson, Nov 18 '14 at 14:36
can you answer my comment 3 items above? Is that an example of where delays would occur? If that is correct, my question (somewhat vaguely I admit ) is answered. — hbogert, Nov 18 '14 at 15:16
@hbogert, yes, as far as I understand this is a case which could cause a bypass delay. That's why I showed different ways of doing this. You could try `_mm_blend_epi16` which would not cause a delay as I suggested to see if it makes a difference. But your loop probably needs to be highly optimized to notice a difference. — Z boson, Nov 19 '14 at 08:37
@hbogert, the bypass delay is also a latency delay so if your loop is not latency bound and is instead throughput bound then you probably won't notice a delay anyway. In general I try and make my loops throughput bound anyway. — Z boson, Nov 19 '14 at 09:16

implications of using _mm_shuffle_ps on integer vector

1 Answers1

Linked