How to shuffle values in an array and store shuffled values into different arrays with AVX

Question

Thanks in advance for the help. I need to be able to perform the following shuffle pattern in an array with uint16_t data. My unprocessed array will look like the following

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

I have transformed my unprocessed data into the format below with _mm512_permutexvar_epi16

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3

and then store the contents of the AVX register into 4 different arrays, this is the part I'm unsure on the best way to do.

next eight values of arrayofZero's 0 0 0 0 0 0 0 0 
next eight values of arrayofOne's 1 1 1 1 1 1 1 1 
next eight values of arrayofTwo's 2 2 2 2 2 2 2 2
next eight values of arrayofThree's 3 3 3 3 3 3 3

I need to loop through my unprocessed data and populate the arrayofZero's with all the 0 values and so on and so forth with my 1, 2, and 3 values. NOTE: my actual data is not hardcoded 0, 1, 2, 3. It is calculated data that I need to put the

1st value in the 1st array, 
2nd value in the 2nd array, 
3rd value in the 3rd processed data array, 
and 4th value in the 4th processed data array

that pattern repeats for the entire unprocessed data array. Such that after all processing is done

1st Array holds all the 0 values
2nd Array holds all the 1 values
3rd array holds all the 2 values
4th array holds all the 3 values

I have been looking at _mm512_permutexvar_epi16 to get my unprocessed data into the format.

Below is the code that I have started.

#include <immintrin.h>
#include <array>
#include <iostream>

int main()
{
    alignas(64) std::array<uint16_t, 128> unprocessedData;
    alignas(64) std::array<uint16_t, 32> processedData0, processedData1, processedData2, processedData3; 
    alignas(64) constexpr std::array<uint16_t, 32> shuffleMask {
         0, 4, 8, 12, 16, 20, 24, 28,
         1, 5, 9, 13, 17, 21, 25, 29,
         2, 6, 10, 14, 18, 22, 26, 30,
         3, 7, 11, 15, 19, 23, 27, 31,
    };
    //prepare sample data
    for (uint16_t i {0}; i < unprocessedData.size(); i+=4)
    {
        unprocessedData[i] = 0;
        unprocessedData[i+1] = 1;
        unprocessedData[i+2] = 2;
        unprocessedData[i+3] = 3;
    } 
    for (size_t i {0}; i < unprocessedData.size(); i+=32)
    {
            auto v {_mm512_loadu_epi16(&unprocessedData[i]) };
            _mm512_storeu_epi16(&unprocessedData[i],
                                 _mm512_permutexvar_epi16(_mm512_load_si512((__m512i*)shuffleMask.data()), v));
        //Somehow Store values 0-7 of permuted array into processedData0
        //Store values 8-15 of permuted array into processedData1
        //Store values 16-23 of permuted array into processedData2
        //Store values 24-31 of permuted array into processedData3
    }
    return 0;
}

Are there really only 4 different possible values? Your best bet would be CountingSort (i.e. make a histogram especially on Ice Lake for fast-short-rep if that works for `rep stosw` to make N copies of a value). Or maybe 4x compare -> 4x `vpcompressw`. — Peter Cordes, Jul 16 '21 at 18:06
Peter, no the values can be any real number. But those values will go into the 4 different bins. Perhaps it would be better to label the example UnprocessedData = a b c d e f g h i j k L processedData1 = a e i processedData2 = b f j processedData3 = c g k processedData4 = d h L — Matthew Pittenger, Jul 16 '21 at 18:07
Ok, but there are only 4 bins, and you can check which bin a number should go in with what, one `cmpeq`? Or a `cmpgt_mask` / `mask_cmple_mask` (zero-masked compare into mask) to check that `x > low && x <= high`? Oh, `vpcompressw` is AVX512-VBMI2 (Ice Lake) :/ https://github.com/HJLebbink/asm-dude/wiki/VPCOMPRESS — Peter Cordes, Jul 16 '21 at 18:10
I don't have to check what value the bin is in. I just thought of the more proper phrasing, sorry for it taking me longer to get there. I have 4 way interleaved data in unprocessedData and what I need to do is de-interleave that data. Below would be the plain c++ of what I need to do. `for (size_t i {0}, j {0}; i < unprocessedData.size(); i+=4, j++) { processedData1[j] = unprocessedData[i]; processedData2[j] = unprocessedData[i+1]; processedData3[j] = unprocessedData[i+2];processedData4[j] = unprocessedData[i+3];}` — Matthew Pittenger, Jul 16 '21 at 18:13
Oh, so it's always a fixed shuffle, not value-dependent? Yeah that's trivial. Just `vpermw` like you're doing, and store separate parts of your vector separately. like `_mm_storeu_si128(a0, _mm512_castsi512_si128(v))` for the low lane, then `_mm512_extracti32x4_epi32(v, 1)` to get the 2nd 128-bit chunk, etc. Your question seemed to focus on the shuffling, not the storing. — Peter Cordes, Jul 16 '21 at 18:17
Peter, thanks! Sorry for the confusion, I was struggling on the storing front. So that would take 10 AVX instructions per chunk, 1 load, 1 permute, 3 extracts, 4 stores? — Matthew Pittenger, Jul 16 '21 at 18:24
1 memory-source permute, 1 `vmovdqu` store, 3x *memory-destination* `vextracti32x4` (which doesn't involve any shuffle uops: https://uops.info/). The compiler will take care of this for you with `tmp = extract; _mm_storeu_si128(dst, tmp);` - intrinsics might look like asm but aren't asm. — Peter Cordes, Jul 16 '21 at 18:40
Although `vpermw zmm, zmm, [mem]` doesn't micro-fuse the load, so it costs 3 uops in the front-end, same as a separate load + permute for uop counts, only saving machine-code size. (`vpermw` without memory is a 2-uop instruction, unfortunately, even on Ice Lake). Hmm, might be worth doing 2x `vpermt2w` (2p05 + 4p5) to set up for 4x 256-bit stores, but that would take 2 separate shuffle control vectors. If you had Ice Lake, it would be more efficient to use `vpermb` (1 uop for p5), because unfortunately its vpermw is still 2 uops :/ But ICL does have 2/clock store throughput. — Peter Cordes, Jul 16 '21 at 18:43
Peter, you really know your stuff. I know there is the intel developer's guide and architecture guide, is that where you learned all of this, or are there any more digestible sources you would recommend? — Matthew Pittenger, Jul 16 '21 at 18:45
Agner Fog has a really nice table of SIMD instructions for different kinds of data-movement in the SIMD chapter of his asm optimization guide. (https://www.agner.org/optimize/). It mostly covers SSE last I looked at it, though, so just in-lane data movement for AVX1/2. AVX1 and AVX2 added a few shuffles (and of course insert/extract including with memory operands). I already had a handle on that (and an idea of the cost in uops on Haswell/Skylake) before AVX-512 CPUs were out, so the huge amount of new AVX-512 instructions were something I could learn gradually. — Peter Cordes, Jul 16 '21 at 18:53
Other useful sources include https://www.officedaytime.com/simd512e/, but for understanding efficiency you really need to be looking at stuff like https://uops.info/ or Agner Fog's instruction tables to see which things cost a shuffle uop or not. (And to understand intrinsics, look at how they compile to asm. [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116)). See also other links in https://stackoverflow.com/tags/x86/info. Answering SIMD questions on SO for ~5 years (and reading other people's answers) is how I learned. — Peter Cordes, Jul 16 '21 at 18:56
These are awesome resources, thank you so much. I'm just starting to get into the AVX world and beginning to change the way I think. I'll probably be posting more questions as I come across other problems that I can't surf my way through the intrinsic guide. It's hard to know what to look for sometimes. If you post an answer I will mark it has correct and answered. — Matthew Pittenger, Jul 16 '21 at 20:08

How to shuffle values in an array and store shuffled values into different arrays with AVX

0 Answers0