How to move eight mmask64 to one m512i "bitwise transposed"

Question

I'm currently working on an image processing routine dealing with high resolution 8bit images.
After computing eight __mmask64 I need to pack them to one __m512i bitwise transposed for further processing, and I came up with following solution:

const __m512i c_128 = _mm512_set1_epi8(128);
const __m512i c_64 = _mm512_set1_epi8(64);
const __m512i c_32 = _mm512_set1_epi8(32);
const __m512i c_16 = _mm512_set1_epi8(16);
const __m512i c_8 = _mm512_set1_epi8(8);
const __m512i c_4 = _mm512_set1_epi8(4);
const __m512i c_2 = _mm512_set1_epi8(2);
const __m512i c_1 = _mm512_set1_epi8(1);
__mmask64 m128, m64, m32, m16, m8, m4, m2, m1;
__m512i vector;

// .
// generate mask
// .
// .
// .

vector = _mm512_maskz_mov_epi8(m128, c_128);
vector = _mm512_mask_add_epi8(vector, m64, vector, c_64);
vector = _mm512_mask_add_epi8(vector, m32, vector, c_32);
vector = _mm512_mask_add_epi8(vector, m16, vector, c_16);
vector = _mm512_mask_add_epi8(vector, m8, vector, c_8);
vector = _mm512_mask_add_epi8(vector, m4, vector, c_4);
vector = _mm512_mask_add_epi8(vector, m2, vector, c_2);
vector = _mm512_mask_add_epi8(vector, m1, vector, c_1);

And I don't like it even though it works:

eight zmm registers are occupied by dull constants
eight instructions for creating a single vector are too many
chain of dependency

I've been looking for instructions/intrinsics that could do that above more elegantly, but AVX512 simply has so many subsets with hundreds of instructions total.

Could someone give me some hints on this? Even just naming some instructions/instructions would help me tremendeously. - Or - Did I already find the best solution?

Thanks in advance.

@harold Yes, I do. And it really doesn't matter which subset those particular instructions belong to. I'd even buy a new computer for that. Even better - that would be a perfect excuse for buying a new machine. — Jake 'Alquimista' LEE, Feb 25 '22 at 15:53
How do you compute your `__mmask64`? Can you already transpose them during generation? Otherwise, some simple advice: 1) use bitwise-or instead of addition 2) generate the upper and lower half separately to split the dependency chain. — chtz, Feb 25 '22 at 16:00
Nevermind my suggestion 1) -- this is not supported for 64bit masks ... — chtz, Feb 25 '22 at 16:08
@chtz it's by `cmpgt_epu8`, and I cannot transpose them - at least not without additional computation that would exceed 8 cycles. 2) I'm already doing that way. - two vectors with 4bits filled each. I just streamlined them for this question. I thought there to be some "magic" instructions :-) — Jake 'Alquimista' LEE, Feb 25 '22 at 16:19
Can you efficiently get these masks together in the form of an `__m512i`? Just computing masks and moving them would *work* but I think just doing that already costs about as much as what you have. If you had the masks in that form, there would be a trick with a byte permute and an `vgf2p8affineqb` (at least I think so, I didn't work it out fully, and I think it's a waste of time to work it out if you cannot efficiently get the masks in that form in the first place) — harold, Feb 25 '22 at 16:25
@harold In fact, I do 16 `cmpgt_epu8` (16 surrounding pixels vs the center one) that can hardly be altered since those 16 "surrounding pixels" come from `alignr`. — Jake 'Alquimista' LEE, Feb 25 '22 at 16:31
@harold You already helped me out greatly by pointing out GFNI. I just wonder why Intel hid the GFNI subset in"other" category in their Intrinsics guide. — Jake 'Alquimista' LEE, Feb 25 '22 at 16:35
@harold: if OoO exec can hide the latency of a store-forwarding stall, 8x `kmov` stores of the masks to an `alignas(64)` array and a 64-byte vector reload could work. That would take pressure off the ALU ports, and is ok for throughput - [What are the costs of failed store-to-load forwarding on x86?](https://stackoverflow.com/q/46135369) - an in-flight SF stall doesn't block successful store-forwarding or other stores. Maybe with some software-pipelining or a 2-phase cache-blocked arrangement you could give the stores enough time to commit and avoid the SF stall. — Peter Cordes, Feb 26 '22 at 01:17
@Jake'Alquimista'LEE: GFNI has legacy-SSE forms so they could implement it on their low-power CPUs that didn't support AVX2. (To accelerate RAID6 in NAS boxes, for example.) That makes it special, and not just an AVX512 extension. Although with GFNI + AVX1 or AVX512, there are wide versions of it. Intel's online intrinsics guide seems to mis-label the 128-bit versions as requiring AVX-512VL + GFNI, only using the `v` mnemonic. https://github.com/HJLebbink/asm-dude/wiki/GF2P8AFFINEQB shows the existence of the legacy-SSE GF2P8AFFINEQB encoding. — Peter Cordes, Feb 26 '22 at 01:19

How to move eight __mmask64 to one __m512i "bitwise transposed"

0 Answers0

How to move eight mmask64 to one m512i "bitwise transposed"