6

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:

#include <iostream>
#include <immintrin.h>
#include <sstream>

using namespace std;

template <typename T>
std::string __m128i_toString(const __m128i var) {
    std::stringstream sstr;
    const T* values = (const T*) &var;
    if (sizeof(T) == 1) {
        for (unsigned int i = 0; i < sizeof(__m128i); i++) {
            sstr << (int) values[i] << " ";
        }
    } else {
        for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
            sstr << values[i] << " ";
        }
    }
    return sstr.str();
}



int main(){

  cout << "Starting SSE test" << endl;
  cout << "integer shuffle" << endl;

 int A[] = {1,  -2147483648, 3, 5};
 int B[] = {4, 6, 7, 8};

  __m128i pC;

  __m128i* pA = (__m128i*) A;
  __m128i* pB = (__m128i*) B;

  *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
  pC = _mm_add_epi32(*pA,*pB);

  cout << "A[0] = " << A[0] << endl;
  cout << "A[1] = " << A[1] << endl;
  cout << "A[2] = " << A[2] << endl;
  cout << "A[3] = " << A[3] << endl;

  cout << "B[0] = " << B[0] << endl;
  cout << "B[1] = " << B[1] << endl;
  cout << "B[2] = " << B[2] << endl;
  cout << "B[3] = " << B[3] << endl;

  cout << "pA = " << __m128i_toString<int>(*pA) << endl;
  cout << "pC = " << __m128i_toString<int>(pC) << endl;
}

Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):

vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd  16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call    __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....

Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

hbogert
  • 4,198
  • 5
  • 24
  • 38
  • Accessing the SSE/AVX-registers with an incompatible type *may* hurt performance. (Only on the newest intel processors AFAIK) – EOF Nov 17 '14 at 23:01
  • See the comments in this Question [difference-between-the-avx-instructions-vxorpd-and-vpxor](https://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor). Particually the first one by Mysticial – Z boson Nov 18 '14 at 08:32
  • [mm-shuffle-ps-equivalent-for-integer-vectors-m128i](https://stackoverflow.com/questions/13153584/mm-shuffle-ps-equivalent-for-integer-vectors-m128i). – Z boson Nov 18 '14 at 13:44
  • I was aware of that similar question, though there the question was to have a functional equivalent, my question uses the shuffle instruction as example. – hbogert Nov 18 '14 at 15:12

1 Answers1

6

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

or change to the floating point domain like this

*pA = _mm_castps_si128( 
        _mm_shuffle_ps(_mm_castsi128_ps(*pA), 
                       _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));

But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner

The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.

You have to test your code and see which method above is more efficient.

Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:

For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].

Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.


The reverse (integer shuffles between FP math instructions) is not as safe:

In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).

On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I've updated my example code and added a snippet of produced assembly. So, if you have e.g. `vshufps $228, 16(%rsp), %xmm1, %xmm0;vpaddd 16(%rsp), %xmm0, %xmm2` where vhufps is the FLOAT domain and VPADDD is in the INT domain, then internally the CPU will do bypasses, thus in this specific at least for xmm0? – hbogert Nov 18 '14 at 12:38
  • @hbogert, sorry, I did not pay enough attention. `_mm_shuffle_ps` and `_mm_shuffle_epi32` are not equivalent. I updated my answer. – Z boson Nov 18 '14 at 14:16
  • @hbogert, you can do this with SSE4.1 using `*pA = _mm_blend_epi16(*pA, *pB, 0xf0);` – Z boson Nov 18 '14 at 14:36
  • can you answer my comment 3 items above? Is that an example of where delays would occur? If that is correct, my question (somewhat vaguely I admit ) is answered. – hbogert Nov 18 '14 at 15:16
  • 1
    @hbogert, yes, as far as I understand this is a case which could cause a bypass delay. That's why I showed different ways of doing this. You could try `_mm_blend_epi16` which would not cause a delay as I suggested to see if it makes a difference. But your loop probably needs to be highly optimized to notice a difference. – Z boson Nov 19 '14 at 08:37
  • @hbogert, the bypass delay is also a latency delay so if your loop is not latency bound and is instead throughput bound then you probably won't notice a delay anyway. In general I try and make my loops throughput bound anyway. – Z boson Nov 19 '14 at 09:16