Given 4 __m256i
mask vectors mask0
, mask1
, mask2
, mask3
with 8 32 bit elements, I would like to pack them into a single __m256i
vector mask
with 32 8 bit elements.
// Pseudocode: these initializer lists with diff lengths wouldn't really work
// input: e.g. from _mm256_cmp_ps or _mm256_cmp_epi32
__m256i mask0 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF };
__m256i mask1 = { 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 };
__m256i mask2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF };
__m256i mask3 = { 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 };
// result:
__m256i mask = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
Any suggestions (also so that I can try to implement them) are very welcome!
EDIT: This is my solution adapted from the linked duplicate:
mask = _mm256_packs_epi16(_mm256_packs_epi32(mask0, mask1), _mm256_packs_epi32(mask2, mask3));
mask = _mm256_permutevar8x32_epi32(mask, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));