There are separate instructions for k
registers; their mnemonics all start with k
so they're easy to find in the table of instructions, like kandnq k0, k0, k1
.
As well as kunpck...
(concatenate, not interleave), kadd
/kshift
, kor
/kand
/knot
/kxor
, and even a kxnor
(handy way to generate all-ones for gather/scatter). Also of course kmov
(including to/from memory or GP-integer), and kortest
and ktest
for branching.
They all come in byte/word/dword/qword sizes for the number of mask bits affected, zero-extending the result. (Without AVX-512BW on a Xeon Phi, only byte and word sizes, since 16 bits covers a ZMM with elements as small as dword. But all mainstream CPUs with AVX-512 have AVX-512BW and thus 64-bit mask registers.)
You can sometimes fold that into another operation to avoid needing a separate instruction to combine masks. Either invert the compare so you can use ktest
directly to branch, or if you want to mask, use a zero-masked compare-into-mask. (Merge-masked compare/test into a 3rd existing mask is not supported.)
AVX-512 integer compares take a predicate as an immediate, rather than only existing as eq
and gt
, so you can invert the condition and use and
instead of needing andn
. (Available in signed vs. unsigned vpcmpub
, also unlike any previous x86 SIMD extension. So if you'd previously been adding 128
to flip the high bit for pcmpgtb
, you don't need that anymore and can just do vpcmpub
.)
vpcmpngtb k1, zmm3, zmm1 ; k0 can't be used for masking, only with k instructions
vpcmpeqb k2{k1}, zmm4, zmm1 ; This is zero-masking even without {z}, because merge masking isn't supported for this
equivalent (except for performance) to:
vpcmpngtb k1, zmm3, zmm1
vpcmpeqb k2, zmm4, zmm1
kand k2, k2, k1
Also equivalent to kandn
with a gt
compare as the NOTed (first) operand, like in your question.
k...
mask instructions can usually only run on port 0, not great performance. https://uops.info/.
A masked compare (or other instruction) has to wait for the mask register input to be ready before starting to work on the other operands. You might hope it would support late forwarding for masks since to only use them at write-back, but IIRC it doesn't. Still, only 1 instruction instead of 2 is still better. Having the first instruction of two able to execute in parallel isn't better unless it was high latency and the mask operation is low latency, and you're latency bound. But often execution-unit throughput is more of a bottleneck when using 512-bit registers. (Since the vector ALUs on port 1 are shut down.)
Some k
instructions are only 1 cycle latency on current CPUs, while others are 4 cycle latency. (Like kshift
and kunpck
, and kadd
.)
The intrinsics for these masked compare-into-mask instructions are _mm256_mask_cmp_ep[iu]_mask
, with a __mmask8/16/32/64
input operand (as well as two vectors and an immediate predicate) and a mask return value. Like the asm, they use ..._mask_...
instead of ..._maskz_...
despite this being zero-masking not merge-masking.
Applying a mask to a vector
Apparently this question wanted to use the mask with another vector, not just get a mask for vpmovmskb
or something. AVX-512 has merge-masking like zmm0{k1}
and zero-masking like zmm0{k1}{z}
when writing to a vector destination. See slides from Kirill Yukhin introducing a bunch of AVX-512 features and the asm syntax for them if you know AVX2 asm but don't already know the basics of AVX-512 new stuff.
;; original code
vpaddb ymm0, ymm1, ymm2
vpcmpgtb ymm0, ymm0, ymm3 ; sum > y3 (signed compare)
vpandn ymm0, ymm0, ymm4 ; masked y4
vpxor ymm0, ymm1, ymm0 ; y0 = y1^y4 in bytes where compare was false
; y0 = y1 where it was true
Using 256-bit vectors on an AVX-512 CPU, you can use vpternlogd
to replace the last 2 instructions (still using AVX2 compare-into-vector as long as you avoid ymm16..31). Unfortunately AVX-512 doesn't have compare-into-vector at all, only into mask. 256-bit vectors can be a good option if your program doesn't spend a lot of its time in SIMD loops, especially on CPUs where the max-turbo penalty is higher for 512-bit vectors. (Not a huge deal with integer vectors, SIMD integer other than multiply is "light", not "heavy")
For 512-bit vectors, we have to use masks. The fully naive drop-in way would be to expand the mask back to a vector with vpmovm2b zmm0, k1
and then vpandnq
/vpxorq
without masking. Or vpternlogd
without masking could still keep the total down to 4 instructions in this case, combining the andn/xor.
A zero-masking vmovdqu8 zmm0{k1}{z}, zmm4
is a better way to replace vpandn
. Or a blend after the xor, using a mask as the control operand. That would still be 4 instructions that all need an execution unit.
If it were possible, e.g. in a different problem with 32-bit elements1, merge-masked XOR would be good (after copying a register unchanged so mov-elimination could work2 if you can't destroy zmm1).
But AVX-512 doesn't have byte-masking for bitwise-booleans; there's only vpxord
and vpxorq
which allow masking in 32 or 64-bit elements. AVX-512BW only added byte/word-element size instructions for vmovdqu
, and for instructions that care about boundaries even without masking, like vpaddb
and vpshufb
.
Our best bet for instruction-level parallelism is to XOR in parallel with the compare, then fix up that result once the compare mask result is ready.
vpaddb zmm0, zmm1, zmm2
vpcmpgtb k1, zmm0, zmm3 ; (sum > z3) signed compare, same as yours
vpxord zmm0, zmm1, zmm4
vmovdqu8 zmm0{k1}, zmm1 ; replace with z1 in bytes where (z1+z2 > z3)
; z0 = z1^z4 in bytes where compare was false
; z0 = z1 where it was true.
The final instruction could equally have been vpblendmb zmm0{k1}, zmm0, zmm1
(manual), which differs from a merge-masking vmovdqu8
only in being able to write the blend result to a 3rd register.
Depending on what you're going to do with that vpxord
result, you might be able to optimize further into the surrounding code, perhaps with vpternlogd
if it's more bitwise booleans. Or perhaps by merge-masking or zero-masking into something else. e.g. perhaps copy zmm1
and do a merge-masked vpaddb
into it, instead of doing the blend.
Another worse way, with less instruction-level parallelism, is to use the same order as your AVX2 code (where the more-ILP way would have required a vpblendvb
which is more expensive.)
; Worse ILP version, direct port of your AVX2 logic
vpaddb zmm0, zmm1, zmm2
vpcmpngtb k1, zmm0, zmm3 ; !(sum > z3) signed compare
vmovdqu8 zmm0{k1}{z}, zmm4 ; zmm4 or 0, like your vpandn result
vpxord zmm0, zmm0, zmm1 ; z0 = z1^z4 in bytes where compare was true
; leaving z0=z1 bytes where the mask was zero (k1[i]==0)
; this is for the inverted compare, ngt = le condition
In this, each instruction depends on the result of the previous, so the total latency from k1
being ready to the final zmm0
being ready is 3 cycles instead of 4. (The earlier version can run vpxord
in parallel with vpcmpb
, assuming ZMM4 is ready early enough.)
Zero-masking (and merge-masking) vmovdqu8
have 3-cycle latency on Skylake-X and Alder Lake (https://uops.info/). Same as vpblendmb
, but vmovdqu32
and 64 have 1-cycle latency.
vpxord
has 1-cycle latency even with masking, but vpaddb
has 3-cycle latency with masking vs. 1 without. So it seems byte-masking is consistently 3-cycle latency, while dword/qword masking keeps the same latency as the unmasked instruction. Throughput isn't affected though, so as long as you have enough instruction-level parallelism, out-of-order exec can hide latency if it's not a long loop-carried dep chain.
Footnote 1: wider elements allow masked booleans
This is for the benefit of future readers who are using a different element size. You definitely don't want to widen your byte elements to dword if you don't have to, that would get 1/4 the work done per vector just to save 1 back-end uop via mov-elimination:
; 32-bit elements would allow masked xor
; but there is no vpxorb
vpaddd zmm0, zmm1, zmm2
vpcmpngtd k1, zmm0, zmm3 ; !(sum > z3) signed compare
;vpxord zmm1{k1}, zmm1, zmm4 ; if destroying ZMM1 is ok
vmovdqa64 zmm0, zmm1 ; if not, copy first
vpxord zmm0{k1}, zmm1, zmm4 ; z0 = z1^z4 in dwords where compare was true
; leaving z0=z1 dwords where the mask was zero (k1[i]==0)
Footnote 2:
vmovdqu8 zmm0, zmm1
doesn't need an execution unit. But vmovdqu8 zmm0{k1}{z}, zmm1
does, and like other 512-bit uops, can only run on port 0 or 5 on current Intel CPUs, including Ice Lake and Alder Lake-P (on systems where its AVX-512 support isn't disabled).
Ice Lake broke mov-elimination only for GP-integer, not vectors, so an exact copy of a register is still cheaper than doing any masking or other work. Only having two SIMD execution ports makes the back-end a more common bottleneck than for code using 256-bit vectors, especially on Ice Lake and later with the 5-wide front-end in Ice Lake, 6-wide in Alder Lake / Sapphire Rapids.
Most code has significant load/store and integer work, though.