1

Given 4 __m256i mask vectors mask0, mask1, mask2, mask3with 8 32 bit elements, I would like to pack them into a single __m256i vector mask with 32 8 bit elements.

// Pseudocode: these initializer lists with diff lengths wouldn't really work
// input: e.g. from _mm256_cmp_ps or _mm256_cmp_epi32
__m256i mask0 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF };
__m256i mask1 = { 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 };
__m256i mask2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF };
__m256i mask3 = { 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 };

// result:
__m256i mask = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };

Any suggestions (also so that I can try to implement them) are very welcome!


EDIT: This is my solution adapted from the linked duplicate:

mask = _mm256_packs_epi16(_mm256_packs_epi32(mask0, mask1), _mm256_packs_epi32(mask2, mask3));
mask = _mm256_permutevar8x32_epi32(mask, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
simonlet
  • 169
  • 6
  • 2
    See _mm256_packs_epi32/epi16 and _mm256_permute4x64_epi64. – Andrey Semashev Sep 10 '21 at 17:32
  • Use `_mm256_setr_epi32` if you want to use 32-bit elements as initializers. `__m256i` in GNU C is defined as `typedef long long __m256i __attribute((vector_size(32), may_alias));`, so the elements are 64-bit, and your initializer lists are too long. – Peter Cordes Sep 10 '21 at 23:39
  • 1
    Also, you know `0xFFFF` only has the low 16 bits set, right? Each hex digit is 4 bits, not eight. So it's not something you could get from `_mm256_cmp_ps`, and it's not what I'd call a 32-bit "mask" because it's not -1 / 0. That raises the question of what packing it means, exactly. Do you really need to get a result that has `0x0F` in the non-zero elements, or would packing with unsigned saturation to `0xFF` / `0x00` be ok? Also, are you limited to AVX1, or do you also have AVX2? – Peter Cordes Sep 10 '21 at 23:40
  • @PeterCordes My bad, of course, it has to be 0xFFFFFFFF / 0x0 in the initializers and 0xFF / 0x0 in the result. I already edited the question. AVX2 is fine too. – simonlet Sep 12 '21 at 10:31
  • 1
    Ok, then yeah just standard `_mm256_packs_epi32` / `_mm256_packs_epi16` with signed saturation will preserve the 0 / -1 values, and fixup for lane crossing. – Peter Cordes Sep 12 '21 at 10:47
  • 1
    [Searching for the key using SIMD](https://stackoverflow.com/q/67227171) includes that sequence of packs. It's not an ideal duplicate, but that's one of the major pieces of its first code blocks. Or [How to convert 32-bit float to 8-bit signed char?](https://stackoverflow.com/q/51778721) also has those shuffles, after FP->int conversion. – Peter Cordes Sep 12 '21 at 10:52
  • Actually, it is still not the result that I want. In the input I have 4 registers with each 8 values, and I want those 32 values to be in my result. Currently, my post says, that I want only 16 8 bit values in my result, which is wrong, because I want 32 8 bit values in my result. I will correct that now. – simonlet Sep 12 '21 at 10:53
  • 4:1 packing, giving 32x 8-bit elements, is what I've been talking about. Hopefully you just meant that the question didn't accurately state what you wanted, not talking about my comments. – Peter Cordes Sep 12 '21 at 10:58
  • 1
    Correct, I was not talking about your comments, but about my question not correctly stating, what I need. The duplicate answers my question correctly. A combination of `_mm256_packs_epi32` / `_mm256_packs_epi16` and a final `_mm256_permutevar8x32_epi32(mask, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7))` gave me the result, that I needed. Thank you! – simonlet Sep 12 '21 at 11:18

0 Answers0