Principle of interleave shuffle with SSE

Question

Target:

For an ordered list of input:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Achieve its interleave shuffle:

1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24

Diagram: Process:

Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:

#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
    float fArray[24] = {.0};
    for(size_t i =0;i<24;i++)
        fArray[i] = (i+1);
    __m128 a0 = _mm_loadu_ps(fArray);
    __m128 a1 = _mm_loadu_ps(fArray+4);
    __m128 a2 = _mm_loadu_ps(fArray+8);
    __m128 a3 = _mm_loadu_ps(fArray+12);
    __m128 a4 = _mm_loadu_ps(fArray+16);
    __m128 a5 = _mm_loadu_ps(fArray+20);

    __m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
    __m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
    __m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
    __m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
    __m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
    __m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));

    __m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
    __m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
    __m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
    __m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
    __m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
    __m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));

    __m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
    __m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
    __m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
    __m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
    __m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
    __m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));

     _mm_storeu_ps(fArray,d0);
     _mm_storeu_ps(fArray+4,d1);
     _mm_storeu_ps(fArray+8,d2);
     _mm_storeu_ps(fArray+12,d3);
     _mm_storeu_ps(fArray+16,d4);
     _mm_storeu_ps(fArray+20,d5);

Questions

To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).

Besides, can anyone help clariying above algorithms Or providing me relevant materials?

@PaulR Seems to be, but how to relate this thread with shuffle? — Finley, Jan 04 '19 at 09:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats. — Paul R, Jan 04 '19 at 09:21
Note that in the 3x8 float case you only need 5 shuffles, see this [question and answer](https://stackoverflow.com/questions/44984724/whats-the-fastest-stride-3-gather-instruction-sequence). For the 4x4 case there exists a handy [macro](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&text=transpose&expand=5915). See also [here](https://stackoverflow.com/a/29587984) for other cases. — wim, Jan 04 '19 at 22:51
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles. — wim, Jan 04 '19 at 23:56

Principle of interleave shuffle with SSE

0 Answers0