_mm_shuffle_ps() equivalent for integer vectors (__m128i)?

Question

The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output.

For example:

R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2))

will result in:

R[0] = L1[2];
R[1] = L1[3];
R[2] = H1[2];
R[3] = H1[3]

I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving?

The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two.

Depends on the size of the elements. If you need 32 bit ints, just use `_mm_shuffle_ps`, this will work on ints too. — Gunther Piez, Oct 31 '12 at 08:26
so i should just typecast __m128i to __m128? let me see if that works.. — user1715122, Oct 31 '12 at 15:22

Mysticial · Accepted Answer · 2012-10-31T08:33:04.747

14

Nope, there is no integer equivalent to this. So you have to either emulate it, or cheat.

One method is to use _mm_shuffle_epi32() on A and B. Then mask out the desired terms and OR them back together.

That tends to be messy and has around 5 instructions. (Or 3 if you use the SSE4.1 blend instructions.)

Here's the SSE4.1 solution with 3 instructions:

__m128i A = _mm_set_epi32(13,12,11,10);
__m128i B = _mm_set_epi32(23,22,21,20);

A = _mm_shuffle_epi32(A,2*1 + 3*4 + 2*16 + 3*64);
B = _mm_shuffle_epi32(B,2*1 + 3*4 + 2*16 + 3*64);

__m128i C = _mm_blend_epi16(A,B,0xf0);

The method that I prefer is to actually cheat - and floating-point shuffle like this:

__m128i Ai,Bi,Ci;
__m128  Af,Bf,Cf;

Af = _mm_castsi128_ps(Ai);
Bf = _mm_castsi128_ps(Bi);
Cf = _mm_shuffle_ps(Af,Bf,_MM_SHUFFLE(3,2,3,2));
Ci = _mm_castps_si128(Cf);

What this does is to convert the datatype to floating-point so that it can use the float-shuffle. Then convert it back.

Note that these "conversions" are bitwise conversions (aka reinterpretations). No conversion is actually done and they don't map to any instructions. In the assembly, there is no distinction between an integer or a floating-point SSE register. These cast intrinsics are just to get around the type-safety imposed by C/C++.

However, be aware that this approach incurs extra latency for moving data back-and-forth between the integer and floating-point SIMD execution units. So it will be more expensive than just the shuffle instruction.

edited Oct 31 '12 at 08:33

answered Oct 31 '12 at 08:18

Mysticial

464,885
45
335
332

That's pretty much what I was about to post, but it took me longer. – harold Oct 31 '12 at 08:26
1

Try `-flax-vector-conversions` – Gunther Piez Oct 31 '12 at 08:28
I wonder how this compares with not switching domains and doing `_mm_shuffle_epi32(_mm_unpackhi_epi32(Ai,Bi), 0xd8)`? – Z boson Nov 18 '14 at 14:42
@Zboson I never actually tried that. I can't say I actually need such a shuffle for integers anymore - since I've always been able to find a better data layout that had other benefits. – Mysticial Nov 18 '14 at 18:57
@Mysticial, yeah, I understand that. But in anycase Agner also says the delay is a latency delay an only matters when latency is an issue and not throughput. I just realized this. – Z boson Nov 19 '14 at 09:12
3

@Zboson: There is no extra bypass delay for using FP shuffles on integer data. (On some CPUs, the reverse is not true. On other CPUs, e.g. AMD, even FP shuffles happen in the ivec domain and impose a bypass delay for `addps` / `shufps` / `addps`.) The same shuffle hardware handles FP and int shuffling; it's just a matter of wiring. Apparently it's possible for HW designers to still make the result of FP shuffles available on the integer forwarding network as well as the FP forwarding network. – Peter Cordes Feb 06 '16 at 09:58

_mm_shuffle_ps() equivalent for integer vectors (__m128i)?

1 Answers1

Linked