Missing AVX-512 intrinsics for masks?

Question

Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing:

KSHIFT{L/R}
KADD
KTEST

The Intel developer manual claims that intrinsics are not necessary as they are auto generated by the compiler. How does one do this though? If it means that __mmask* types can be treated as regular integers, it would make a lot of sense, but testing something like mask << 4 seems to cause the compiler to move the mask to a regular register, shift it, then move back to a mask. This was tested using Godbolt's latest GCC and ICC with -O2 -mavx512bw.

Also interesting to note that the intrinsics only deal with __mmask16 and not other types. I haven't tested much, but it looks like ICC doesn't mind taking in an incorrect type, but GCC does seem to try and ensure that there are only 16-bits in the mask, if you use the intrinsics.

Am I not looking past the correct intrinsics for the above instructions, as well as other __mmask* type variants, or is there another way to achieve the same thing without resorting to inline assembly?

Note that mask instructions can only run on one ALU port on Skylake-avx512. I'm not sure which port, but it's one of the ports that conflicts with vector instructions. (`kmov` to/from integer registers probably also uses that port, so moving to integer and back for a single shift is still dumb for throughput, if not latency). — Peter Cordes, Aug 23 '17 at 13:59
At least for `ktest`/`jcc`, moving to an integer register instead of using `ktest` allows macro-fusion of the `test/jcc` for `-march=skylake-AVX512`. It's just dumb for `-march=knl`. — Peter Cordes, Aug 23 '17 at 14:00
Out of interest, is achieving fusion worth it for the extra KMOV needed? That is, `ktest+jcc` vs `kmov+test/jcc`? — zinga, Aug 24 '17 at 21:54
It's probably at least break even for front-end issue throughput, but worse for code-size. `ktest` + `jcc` is either 2 or 3 uops. Hopefully `ktest` is only 1, but SSE/AVX `ptest` is 2 uops (1 for the test, 1 for moving the result from vector domain to integer, same port as `movd`). `kmov` + `test/jcc` is almost certainly only 2 total uops. — Peter Cordes, Aug 24 '17 at 23:40

Mysticial · Accepted Answer · 2019-04-01T16:56:21.880

10

Intel's documentation saying, "not necessary as they are auto generated by the compiler" is in fact correct. And yet, it's unsatisfying.

But to understand why it is the way it is, you need to look at the history of the AVX512. While none of this information is official, it's strongly implied based on evidence.

The reason the state of the mask intrinsics got into the mess they are now is probably because AVX512 got "rolled out" in multiple phases without sufficient forward planning to the next phase.

Phase 1: Knights Landing

Knights Landing added 512-bit registers which only have 32-bit and 64-bit data granularity. Therefore the mask registers never needed to be wider than 16 bits.

When Intel was designing these first set of AVX512 intrinsics, they went ahead and added intrinsics for almost everything - including the mask registers. This is why the mask intrinsics that do exist are only 16 bits. And they only cover the instructions that exist in Knights Landing. (though I can't explain why KSHIFT is missing)

On Knights Landing, mask operations were fast (2 cycles). But moving data between mask registers and general registers was really slow (5 cycles). So it mattered where the mask operations were being done and it made sense to give the user finer-grained control about moving stuff back-and-forth between mask registers and GPRs.

Phase 2: Skylake Purley

Skylake Purley extends the AVX512 to cover byte-granular lanes. And this increased the width of the mask registers to the full 64 bits. This second round also added KADD and KTEST which didn't exist in the Knights Landing.

These new mask instructions (KADD, KTEST, and 64-bit extensions of existing ones) are the ones that are missing their intrinsic counterparts.

While we don't know exactly why they are missing, there is some strong evidence in support of it:

Compiler/Syntax:

On Knights Landing, the same mask intrinsics were used for both 8-bit and 16-bit masks. There was no way to distinguish between them. By extended them to 32-bit and 64-bit, it made the mess worse. In other words, Intel didn't design the mask intrinsics correctly to begin with. And they decided to drop them completely rather than fix them.

Performance Inconsistencies:

Bit-crossing mask instructions on Skylake Purley are slow. While all bit-wise instructions are single-cycle, KADD, KSHIFT, KUNPACK, etc... are all 4 cycles. But moving between mask and GPR is only 2 cycles.

Because of this, it's often faster to move them into GPRs to do them and move them back. But the programmer is unlikely to know this. So rather than giving the user full control of the mask registers, Intel opted just have the compiler make this decision.

By making the compiler make this decision, it means that the compiler needs to have such logic. The Intel Compiler currently does as it will generate kadd and family in certain (rare) cases. But GCC does not. On GCC, all but the most trivial mask operations will be moved to GPRs and done there instead.

Final Thoughts:

Prior to the release of Skylake Purley, I personally had a lot of AVX512 code written up which includes a lot of AVX512 mask code. These were written with certain performance assumptions (single-cycle latency) that turned out to be false on Skylake Purley.

From my own testing on Skylake X, some of my mask-intrinsic code which relied on bit-crossing operations turned out to be slower than the compiler-generated versions that moved them to GPRs and back. The reason of course is that KADD and KSHIFT was 4 cycles instead of 1.

Of course, I prefer if Intel did provide the intrinsics to give us the control that I want. But it's very easy to go wrong here (in terms of performance) if you don't know what you're doing.

Update:

It's unclear when this happened, but the latest version of the Intel Intrinsics Guide has a new set of mask intrinsics with a new naming convention that covers all the instructions and widths. These new intrinsics supercede the old ones.

So this solves the entire problem. Though the extent of compiler support is still uncertain.

Examples:

_kadd_mask64()
_kshiftri_mask32()
_cvtmask16_u32() supercedes _mm512_mask2int()

edited Apr 01 '19 at 16:56

answered Jul 18 '17 at 18:10

Mysticial

464,885
45
335
332

Thank you for your very detailed answer! Mask operations being *that* slow is indeed rather surprising, and explains the behaviour. I don't have an actual CPU to test on, but if moves between mask/GPRs costs 2 cycles, wouldn't a KSHIFT (4 cycles) still be faster than a move+shift+move (5 cycles), not to mention relieving pressure on the frontend? Also `~mask` seems to cause a kmov+not+kmov sequence on both GCC and ICC, so it seems like intrinsics is the only way to deal with this properly...? – zinga Jul 19 '17 at 00:03
1

If you're doing only one mask instruction, then yes it's cheaper to just do it with a mask instruction. These are the cases where I can occasionally get ICC to generate them. But if you're doing anything like `KADD`, `KSHIFT`, `KUNPACK`, chances are you're probably doing multiple mask instructions. It doesn't take much before it becomes cheaper to go to GPR and back. And also note that mask instructions only have 1/cycle throughput whereas GPR integer instructions generally are 2-4/cycle. – Mysticial Jul 19 '17 at 01:12
2

As far as compilers not generating the optimal sequence, AVX512 is still new and the optimizers are still immature with respect to them. So in the end, if you want full control, you need inline assembly. And even then, there are certain bugs in ICC that make this less useful. – Mysticial Jul 19 '17 at 01:13
Fair enough, let's hope that the situation improves then. Thanks again for the explanation! – zinga Jul 19 '17 at 02:43
Which port do mask instructions run on on SKL-X? It's port 0 or 5, isn't it? And that includes `kmov` to/from GPR, right? So a single kshift has half the impact on vector instruction throughput as kmov to GPR and back, if that's correct. (`shl` can run on port 6.) It's also 3x more front-end uops. But if you're doing multiple things with mask data, then yeah moving to GPRs should be much better on SKL-X. – Peter Cordes Aug 23 '17 at 14:09
@PeterCordes I don't know. I haven't tested it and Agner hasn't released numbers yet. – Mysticial Aug 23 '17 at 18:16
1

@Mysticial: If you get a chance, you could check ports without needing perf counters by checking for resource conflicts with other instructions that run on known ports. e.g. check for p5 with shuffle + kshift throughput. p1 with `imul` + kshift throughput. p0 with `movd eax, xmm0` or `pmovmskb` + kshift throughput. (Or I guess with 512b instructions shutting down p1 for vector ops, lots of things run only on p0, like `pmullw`.) – Peter Cordes Aug 23 '17 at 18:53
2

@PeterCordes Looks like someone beat Agner to it: https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v10_PUB.ods – Mysticial Sep 13 '17 at 20:59
@Mysticial: oh, looks like they got data from IACA. That makes sense, I forgot that it had uop->port data. It says most `k` instructions run on P0, but `kshift` and `kunpck` run on p5. `ktest` and `kortest` are single-uop, unlike SSE/AVX PTEST. `kmov* r32, k` is P0, `kmov* k, r32` is P5. So unfortunately P1 isn't used by `k` instructions :/ – Peter Cordes Sep 13 '17 at 22:07
1

@PeterCordes The latest version of the Intel intrinsics guide has a new set of mask intrinsics that cover everything. It has a new naming convention that supercedes the old ones! – Mysticial Apr 01 '19 at 16:47
1

@PeterCordes Looks like it's GCC 7, ICC 18, and Clang 8. No support in MSVC yet. – Mysticial Apr 01 '19 at 17:01
Interesting; I looked at GCC's headers and it really does have builtins like `__builtin_ia32_kaddsi`, not just wrappers around the `+` operator which would have let code compile without giving the desired functionality. – Peter Cordes Apr 01 '19 at 22:40
1

@PeterCordes The mask types themselves are typedef'ed to normal integers. So you can use them like normal integers anyway. In the end, the compiler ends up choosing whether to use GPRs or mask regs. And they almost always prefer GPRs since they're faster. It's really hard to get any compiler to use mask regs for arithmetic without the intrinsincs. – Mysticial Apr 01 '19 at 23:39
My point was that I wondered if GCC really had support in place for getting `kadd` instructions emitted, or just for compiling code that uses the `_kadd_mask64` without actually doing any different code-gen. (Sort of like the placeholder support for `_addcarryx_u32` which never uses `adox` for a 2nd dep chain.) So it's cool that GCC and clang did add builtins to back up these intrinsics. Heh, just looked at code-gen for functions that take mask args and return a mask: they're passed in integer regs (of course), and clang goes nuts https://godbolt.org/z/gCs0kI bouncing to xmm0 for one size! – Peter Cordes Apr 02 '19 at 01:14

Missing AVX-512 intrinsics for masks?

1 Answers1

Linked

Related