AVX512BW vpcmpgtb perform instruction on its K result

Question

I want to compare a ZMM vector and using its result and performing vpandn.
in AVX2, i do this :

vpcmpgtb ymm0, ymm0, ymm1
vpandn  ymm0, ymm0, ymm2
vpxor   ymm0, ymm0, ymm3

But in AVX512BW, vpcmpgtb returns result in a K.
How should I perform vpandn then vpxor on its result in AVX512BW?

vpcmpgtb k0, zmm0, zmm1
vpandn ??
vpxor ??

Peter Cordes · Accepted Answer · 2022-05-26T18:40:13.627

There are separate instructions for k registers; their mnemonics all start with k so they're easy to find in the table of instructions, like kandnq k0, k0, k1.

As well as kunpck... (concatenate, not interleave), kadd/kshift, kor/kand/knot/kxor, and even a kxnor (handy way to generate all-ones for gather/scatter). Also of course kmov (including to/from memory or GP-integer), and kortest and ktest for branching.

They all come in byte/word/dword/qword sizes for the number of mask bits affected, zero-extending the result. (Without AVX-512BW on a Xeon Phi, only byte and word sizes, since 16 bits covers a ZMM with elements as small as dword. But all mainstream CPUs with AVX-512 have AVX-512BW and thus 64-bit mask registers.)

You can sometimes fold that into another operation to avoid needing a separate instruction to combine masks. Either invert the compare so you can use ktest directly to branch, or if you want to mask, use a zero-masked compare-into-mask. (Merge-masked compare/test into a 3rd existing mask is not supported.)

AVX-512 integer compares take a predicate as an immediate, rather than only existing as eq and gt, so you can invert the condition and use and instead of needing andn. (Available in signed vs. unsigned vpcmpub, also unlike any previous x86 SIMD extension. So if you'd previously been adding 128 to flip the high bit for pcmpgtb, you don't need that anymore and can just do vpcmpub.)

vpcmpngtb   k1,    zmm3, zmm1     ; k0 can't be used for masking, only with k instructions
vpcmpeqb   k2{k1}, zmm4, zmm1     ; This is zero-masking even without {z}, because merge masking isn't supported for this

equivalent (except for performance) to:

vpcmpngtb  k1,    zmm3, zmm1
vpcmpeqb   k2,    zmm4, zmm1
kand       k2,    k2, k1

Also equivalent to kandn with a gt compare as the NOTed (first) operand, like in your question.

k... mask instructions can usually only run on port 0, not great performance. https://uops.info/.

A masked compare (or other instruction) has to wait for the mask register input to be ready before starting to work on the other operands. You might hope it would support late forwarding for masks since to only use them at write-back, but IIRC it doesn't. Still, only 1 instruction instead of 2 is still better. Having the first instruction of two able to execute in parallel isn't better unless it was high latency and the mask operation is low latency, and you're latency bound. But often execution-unit throughput is more of a bottleneck when using 512-bit registers. (Since the vector ALUs on port 1 are shut down.)

Some k instructions are only 1 cycle latency on current CPUs, while others are 4 cycle latency. (Like kshift and kunpck, and kadd.)

The intrinsics for these masked compare-into-mask instructions are _mm256_mask_cmp_ep[iu]_mask, with a __mmask8/16/32/64 input operand (as well as two vectors and an immediate predicate) and a mask return value. Like the asm, they use ..._mask_... instead of ..._maskz_... despite this being zero-masking not merge-masking.

Applying a mask to a vector

Apparently this question wanted to use the mask with another vector, not just get a mask for vpmovmskb or something. AVX-512 has merge-masking like zmm0{k1} and zero-masking like zmm0{k1}{z} when writing to a vector destination. See slides from Kirill Yukhin introducing a bunch of AVX-512 features and the asm syntax for them if you know AVX2 asm but don't already know the basics of AVX-512 new stuff.

;; original code
  vpaddb       ymm0, ymm1, ymm2
  vpcmpgtb     ymm0, ymm0, ymm3    ; sum > y3 (signed compare)
  vpandn       ymm0, ymm0, ymm4    ; masked y4
  vpxor        ymm0, ymm1, ymm0    ; y0 = y1^y4 in bytes where compare was false
                                   ; y0 = y1  where it was true

Using 256-bit vectors on an AVX-512 CPU, you can use vpternlogd to replace the last 2 instructions (still using AVX2 compare-into-vector as long as you avoid ymm16..31). Unfortunately AVX-512 doesn't have compare-into-vector at all, only into mask. 256-bit vectors can be a good option if your program doesn't spend a lot of its time in SIMD loops, especially on CPUs where the max-turbo penalty is higher for 512-bit vectors. (Not a huge deal with integer vectors, SIMD integer other than multiply is "light", not "heavy")

For 512-bit vectors, we have to use masks. The fully naive drop-in way would be to expand the mask back to a vector with vpmovm2b zmm0, k1 and then vpandnq/vpxorq without masking. Or vpternlogd without masking could still keep the total down to 4 instructions in this case, combining the andn/xor.

A zero-masking vmovdqu8 zmm0{k1}{z}, zmm4 is a better way to replace vpandn. Or a blend after the xor, using a mask as the control operand. That would still be 4 instructions that all need an execution unit.

If it were possible, e.g. in a different problem with 32-bit elements¹, merge-masked XOR would be good (after copying a register unchanged so mov-elimination could work² if you can't destroy zmm1).

But AVX-512 doesn't have byte-masking for bitwise-booleans; there's only vpxord and vpxorq which allow masking in 32 or 64-bit elements. AVX-512BW only added byte/word-element size instructions for vmovdqu, and for instructions that care about boundaries even without masking, like vpaddb and vpshufb.

Our best bet for instruction-level parallelism is to XOR in parallel with the compare, then fix up that result once the compare mask result is ready.

  vpaddb     zmm0,    zmm1, zmm2
  vpcmpgtb   k1,      zmm0, zmm3   ; (sum > z3) signed compare, same as yours
  vpxord     zmm0,    zmm1, zmm4
  vmovdqu8   zmm0{k1}, zmm1        ; replace with z1 in bytes where (z1+z2 > z3)
        ; z0 = z1^z4 in bytes where compare was false
        ; z0 = z1 where it was true.

The final instruction could equally have been vpblendmb zmm0{k1}, zmm0, zmm1 (manual), which differs from a merge-masking vmovdqu8 only in being able to write the blend result to a 3rd register.

Depending on what you're going to do with that vpxord result, you might be able to optimize further into the surrounding code, perhaps with vpternlogd if it's more bitwise booleans. Or perhaps by merge-masking or zero-masking into something else. e.g. perhaps copy zmm1 and do a merge-masked vpaddb into it, instead of doing the blend.

Another worse way, with less instruction-level parallelism, is to use the same order as your AVX2 code (where the more-ILP way would have required a vpblendvb which is more expensive.)

; Worse ILP version, direct port of your AVX2 logic
  vpaddb     zmm0,    zmm1, zmm2
  vpcmpngtb  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
  vmovdqu8   zmm0{k1}{z}, zmm4     ; zmm4 or 0, like your vpandn result
  vpxord     zmm0, zmm0, zmm1      ; z0 = z1^z4 in bytes where compare was true
                       ; leaving z0=z1 bytes where the mask was zero  (k1[i]==0)
         ; this is for the inverted compare, ngt = le condition

In this, each instruction depends on the result of the previous, so the total latency from k1 being ready to the final zmm0 being ready is 3 cycles instead of 4. (The earlier version can run vpxord in parallel with vpcmpb, assuming ZMM4 is ready early enough.)

Zero-masking (and merge-masking) vmovdqu8 have 3-cycle latency on Skylake-X and Alder Lake (https://uops.info/). Same as vpblendmb, but vmovdqu32 and 64 have 1-cycle latency.

vpxord has 1-cycle latency even with masking, but vpaddb has 3-cycle latency with masking vs. 1 without. So it seems byte-masking is consistently 3-cycle latency, while dword/qword masking keeps the same latency as the unmasked instruction. Throughput isn't affected though, so as long as you have enough instruction-level parallelism, out-of-order exec can hide latency if it's not a long loop-carried dep chain.

Footnote 1: wider elements allow masked booleans

This is for the benefit of future readers who are using a different element size. You definitely don't want to widen your byte elements to dword if you don't have to, that would get 1/4 the work done per vector just to save 1 back-end uop via mov-elimination:

; 32-bit elements would allow masked xor
; but there is no vpxorb

vpaddd     zmm0,    zmm1, zmm2
vpcmpngtd  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
 ;vpxord    zmm1{k1}, zmm1, zmm4   ; if destroying ZMM1 is ok

vmovdqa64  zmm0,    zmm1         ; if not, copy first
vpxord     zmm0{k1}, zmm1, zmm4  ; z0 = z1^z4 in dwords where compare was true
                    ; leaving z0=z1 dwords where the mask was zero  (k1[i]==0)

Footnote 2:

vmovdqu8 zmm0, zmm1 doesn't need an execution unit. But vmovdqu8 zmm0{k1}{z}, zmm1 does, and like other 512-bit uops, can only run on port 0 or 5 on current Intel CPUs, including Ice Lake and Alder Lake-P (on systems where its AVX-512 support isn't disabled).

Ice Lake broke mov-elimination only for GP-integer, not vectors, so an exact copy of a register is still cheaper than doing any masking or other work. Only having two SIMD execution ports makes the back-end a more common bottleneck than for code using 256-bit vectors, especially on Ice Lake and later with the 5-wide front-end in Ice Lake, 6-wide in Alder Lake / Sapphire Rapids.

Most code has significant load/store and integer work, though.

I really stuck with this ... i got what u said but i can't fix my problem ... this is my source-code `vpaddb ymm0, ymm1, ymm2` `vpcmpgtb ymm0, ymm0, ymm3` `vpandn ymm0, ymm0, ymm4` `vpxor ymm0, ymm1, ymm0` How can i turn it into AVX512BW? — HelloMachine, May 26 '22 at 12:26
I just updated my question .. Can you please update your response based on it? — HelloMachine, May 26 '22 at 14:54
@HelloMachine: Done. My previous comment was wrong, there is no `vpxorb` so you have to do something different to use byte masks with boolean stuff. — Peter Cordes, May 26 '22 at 17:49
You said using `vpmovm2b` would be suck ! Why ? `vpmovm2b` latency is greater than 1 or that's suck because adds one extra instruction ? — HelloMachine, May 26 '22 at 18:32
@HelloMachine: Because it adds an extra instruction. Hmm, I guess you could use it and then `vpternlogd`, but that would probably be worse critical-path latency. `vpmovm2b` is 3-cycle latency, so this would end up being the same as the worse-ILP version I showed later. — Peter Cordes, May 26 '22 at 18:37
Man, you are amazing ... you know that ? Thank you very much ... — HelloMachine, May 26 '22 at 19:44

AVX512BW vpcmpgtb perform instruction on its K result

1 Answers1

Applying a mask to a vector

Footnote 1: wider elements allow masked booleans