10

The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output.

For example:

R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2))

will result in:

R[0] = L1[2];
R[1] = L1[3];
R[2] = H1[2];
R[3] = H1[3]

I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving?

The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
user1715122
  • 947
  • 1
  • 11
  • 26

1 Answers1

14

Nope, there is no integer equivalent to this. So you have to either emulate it, or cheat.

One method is to use _mm_shuffle_epi32() on A and B. Then mask out the desired terms and OR them back together.

That tends to be messy and has around 5 instructions. (Or 3 if you use the SSE4.1 blend instructions.)

Here's the SSE4.1 solution with 3 instructions:

__m128i A = _mm_set_epi32(13,12,11,10);
__m128i B = _mm_set_epi32(23,22,21,20);

A = _mm_shuffle_epi32(A,2*1 + 3*4 + 2*16 + 3*64);
B = _mm_shuffle_epi32(B,2*1 + 3*4 + 2*16 + 3*64);

__m128i C = _mm_blend_epi16(A,B,0xf0);

The method that I prefer is to actually cheat - and floating-point shuffle like this:

__m128i Ai,Bi,Ci;
__m128  Af,Bf,Cf;

Af = _mm_castsi128_ps(Ai);
Bf = _mm_castsi128_ps(Bi);
Cf = _mm_shuffle_ps(Af,Bf,_MM_SHUFFLE(3,2,3,2));
Ci = _mm_castps_si128(Cf);

What this does is to convert the datatype to floating-point so that it can use the float-shuffle. Then convert it back.

Note that these "conversions" are bitwise conversions (aka reinterpretations). No conversion is actually done and they don't map to any instructions. In the assembly, there is no distinction between an integer or a floating-point SSE register. These cast intrinsics are just to get around the type-safety imposed by C/C++.

However, be aware that this approach incurs extra latency for moving data back-and-forth between the integer and floating-point SIMD execution units. So it will be more expensive than just the shuffle instruction.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • That's pretty much what I was about to post, but it took me longer. – harold Oct 31 '12 at 08:26
  • 1
    Try `-flax-vector-conversions` – Gunther Piez Oct 31 '12 at 08:28
  • I wonder how this compares with not switching domains and doing `_mm_shuffle_epi32(_mm_unpackhi_epi32(Ai,Bi), 0xd8)`? – Z boson Nov 18 '14 at 14:42
  • @Zboson I never actually tried that. I can't say I actually need such a shuffle for integers anymore - since I've always been able to find a better data layout that had other benefits. – Mysticial Nov 18 '14 at 18:57
  • @Mysticial, yeah, I understand that. But in anycase Agner also says the delay is a latency delay an only matters when latency is an issue and not throughput. I just realized this. – Z boson Nov 19 '14 at 09:12
  • 3
    @Zboson: There is no extra bypass delay for using FP shuffles on integer data. (On some CPUs, the reverse is not true. On other CPUs, e.g. AMD, even FP shuffles happen in the ivec domain and impose a bypass delay for `addps` / `shufps` / `addps`.) The same shuffle hardware handles FP and int shuffling; it's just a matter of wiring. Apparently it's possible for HW designers to still make the result of FP shuffles available on the integer forwarding network as well as the FP forwarding network. – Peter Cordes Feb 06 '16 at 09:58