I'm currently working on an image processing routine dealing with high resolution 8bit images.
After computing eight __mmask64
I need to pack them to one __m512i
bitwise transposed for further processing, and I came up with following solution:
const __m512i c_128 = _mm512_set1_epi8(128);
const __m512i c_64 = _mm512_set1_epi8(64);
const __m512i c_32 = _mm512_set1_epi8(32);
const __m512i c_16 = _mm512_set1_epi8(16);
const __m512i c_8 = _mm512_set1_epi8(8);
const __m512i c_4 = _mm512_set1_epi8(4);
const __m512i c_2 = _mm512_set1_epi8(2);
const __m512i c_1 = _mm512_set1_epi8(1);
__mmask64 m128, m64, m32, m16, m8, m4, m2, m1;
__m512i vector;
// .
// generate mask
// .
// .
// .
vector = _mm512_maskz_mov_epi8(m128, c_128);
vector = _mm512_mask_add_epi8(vector, m64, vector, c_64);
vector = _mm512_mask_add_epi8(vector, m32, vector, c_32);
vector = _mm512_mask_add_epi8(vector, m16, vector, c_16);
vector = _mm512_mask_add_epi8(vector, m8, vector, c_8);
vector = _mm512_mask_add_epi8(vector, m4, vector, c_4);
vector = _mm512_mask_add_epi8(vector, m2, vector, c_2);
vector = _mm512_mask_add_epi8(vector, m1, vector, c_1);
And I don't like it even though it works:
- eight zmm registers are occupied by dull constants
- eight instructions for creating a single vector are too many
- chain of dependency
I've been looking for instructions/intrinsics that could do that above more elegantly, but AVX512
simply has so many subsets with hundreds of instructions total.
Could someone give me some hints on this? Even just naming some instructions/instructions would help me tremendeously. - Or - Did I already find the best solution?
Thanks in advance.