Permuting bytes inside SSE __m128i register

Question

I have following problem:

In __m128i register there are 16 8bit values in following ordering:

[ 1, 5, 9, 13 ] [ 2, 6, 10, 14] [3, 7, 11, 15]  [4, 8, 12, 16]

What I would like to achieve is efficiently shuffle bytes to get this ordering:

[ 1, 2, 3, 4 ] [ 5, 6, 7, 8] [9, 10, 11, 12]  [13, 14, 15, 16]

It is actually analog to 4x4 matrix transposition, but operating on 8-bits element inside one register.

Do you please can point me to what kind of SSE (preferabbly <= SSE2) instructions are suitable for realizing this ?

Well with SSSE3 there's `pshufb`, otherwise it's going to get a little messy — harold, Jul 06 '14 at 10:45
Thanks for pointing me to pshufb ! I think I can move from there :) — born49, Jul 06 '14 at 10:54
Aside from SSSE3's `PSHUFB`, which is sort of an Intel exclusive on older CPUs, you're looking at a messy sequence of SSE2 `PUNPCK(L/H)BW` instructions, as @harold pointed out. — Iwillnotexist Idonotexist, Jul 06 '14 at 16:59

Apriori · Accepted Answer · 2014-07-08T06:16:41.740

You really will want to go SSSE3 for this, it's much more clean than trying to go <= SSE2

Your code will look something like this:

   #include <tmmintrin.h> // _mm_shuffle_epi8
   #include <tmmintrin.h> // _mm_set_epi8
   ...
   // check if your hardware supports SSSE3
   ...
   __m128i mask = _mm_set_epi8(15, 11, 7, 3,
                               14, 10, 6, 2,
                               13,  9, 5, 1,
                               12,  8, 4, 0);
   __m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
                               15, 11, 7, 3,
                               14, 10, 6, 2,
                               13,  9, 5, 1);
   mtrx         = _mm_shuffle_epi8(mtrx, mask);

If you really want SSE2 this will suffice:
(assuming I'm interpreting your initial ordering correctly)

  __m128i mask = _mm_set_epi8(0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF);
  __m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
                              15, 11, 7, 3,
                              14, 10, 6, 2,
                              13,  9, 5, 1);                                   // [1, 5, 9, 13] [2,  6, 10, 14] [3,  7, 11, 15] [ 4,  8, 12, 16]
  mtrx = _mm_packus_epi16(_mm_and_si128(mtrx, mask), _mm_srli_epi16(mtrx, 8)); // [1, 9, 2, 10] [3, 11,  4, 12] [5, 13,  6, 14] [ 7, 15,  8, 16]
  mtrx = _mm_packus_epi16(_mm_and_si128(mtrx, mask), _mm_srli_epi16(mtrx, 8)); // [1, 2, 3,  4] [5,  6,  7,  8] [9, 10, 11, 12] [13, 14, 15, 16]

Or more easily debuggable:

  __m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
                              15, 11, 7, 3,
                              14, 10, 6, 2,
                              13, 9, 5, 1);            // [1, 5,  9, 13] [ 2,  6, 10, 14] [ 3,  7, 11, 15] [ 4,  8, 12, 16]
  __m128i mask = _mm_set_epi8(0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF,
                              0x00, 0xFF, 0x00, 0xFF);
  __m128i temp = _mm_srli_epi16(mtrx, 8);              // [5, 0, 13,  0] [ 6,  0, 14,  0] [ 7,  0, 15,  0] [ 8,  0, 16,  0]
  mtrx         = _mm_and_si128(mtrx, mask);            // [1, 0,  9,  0] [ 2,  0, 10,  0] [ 3,  0, 11,  0] [ 4,  0, 12,  0]
  mtrx         = _mm_packus_epi16(mtrx, temp);         // [1, 9,  2, 10] [ 3, 11,  4, 12] [ 5, 13,  6, 14] [ 7, 15,  8, 16]
  temp         = _mm_srli_epi16(mtrx, 8);              // [9, 0, 10,  0] [11,  0, 12,  0] [13,  0, 14,  0] [15,  0, 16,  0]
  mtrx         = _mm_and_si128(mtrx, mask);            // [1, 0,  2,  0] [ 3,  0,  4,  0] [ 5,  0,  6,  0] [ 7,  0,  8,  0] 
  mtrx         = _mm_packus_epi16(mtrx, temp);         // [1, 2,  3,  4] [ 5,  6,  7,  8] [ 9, 10, 11, 12] [13, 14, 15, 16]

Many thanks for your explanation ! I will use SSE3 as main path and fallback to SSE2 version on platforms without SSE3. — born49, Jul 07 '14 at 08:18
@user3809354: No problem! Keep in mind you'll want to call CPUID once at start-up and store whatever you need to determine what version of the code to run. [MSDN](http://msdn.microsoft.com/en-us/library/y0dh78ez(vs.80).aspx) is a pretty good starting point for seeing what's there. You can navigate up the tree to see SSSE3 and SSE4 intrinsics. Somethings on there I've found could really use examples (e.g. shuffle mask value meanings), but that's what SO is for. :) — Apriori, Jul 07 '14 at 20:52
@user3809354: btw, I realized you can save a couple of instruction in the SSE2 version by calling _mm_srli_epi16 instead of _mm_srli_si128, because it will shift in zeros for each 16-bit component. This avoids a later mask because zeros need to be in the high byte of each 16-bit component before calling pack. This is because there is not a version of pack that truncates (i.e. _mm_packs_epi16 and _mm_packus_epi16 perform signed and unsigned saturate respectively) so to throw away the upper byte, it needs to contain zeros. I've updated the answer's code to reflect this. — Apriori, Jul 08 '14 at 06:25
Thanks again ! I have managed to get it running because of your help ! On the other note I must say that the "orthogonality" of the SSE / AVX instruction sets made me laugh several times :) — born49, Jul 08 '14 at 19:54

Permuting bytes inside SSE __m128i register

1 Answers1

Linked