2

I've noticed several instances of clang disregarding the documented instructions of masked AVX-512 intrinsics and substituting slower instruction sequences. This really undermines the expectation of programmer control, otherwise, why bother using intrinsics?

Here's an egregious example I've encountered (godbolt) which led to a 3x slowdown with clang's output compared to gcc. Expecting this:

avx512_low_insert:
        vptestnmq       %zmm0, %zmm0, %k0
        movl    $1, %eax
        kmovb   %eax, %k2
        knotb   %k0, %k1
        kaddb   %k2, %k1, %k1
        kandb   %k1, %k0, %k1
        vpbroadcastq    %rdi, %zmm0 {%k1}

we instead obtain (with clang 16.x, current release at time of writing) the much more expensive:

avx512_low_insert:
        vptestmq        %zmm0, %zmm0, %k0
        movb    $1, %al
        kmovd   %eax, %k1
        kaddb   %k1, %k0, %k1
        vptestnmq       %zmm0, %zmm0, %k1 {%k1}
        vpbroadcastq    %rdi, %zmm0 {%k1}

Clang is essentially disregarding the intrinsics specified and substituting its own, inferior, ideas.

Short of hand-rolling inline asm, is there any way I can persuade it otherwise?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
inopinatus
  • 3,597
  • 1
  • 26
  • 38
  • 1
    Generally no, clang treats intrinsics as inputs to its optimizer, in the same way `+` is an addition operator, not an intrinsic for `add`. And clang is very aggressive about it, especially with shuffles and blends (for SSE and AVX1/2, not just AVX-512). This can be a good thing when it has better idea and finds tricks you missed, but can be quite bad when there are missed optimizations in its optimizer, or it's not considering the bottleneck you're optimizing for. e.g. sometimes the best choice for how to do something depends on back-end port pressure from surrounding code. – Peter Cordes Jun 10 '23 at 03:31
  • Understood. The reason makes sense even if I'm not happy with the result! Hand-rolled inline asm it shall be then, for code on the fast path. Thanks @PeterCordes. – inopinatus Jun 10 '23 at 03:34
  • 1
    Or compile hand-optimized loops with GCC maybe? BTW, the clang code is fewer uops for ports 0 and 5; it'll have better throughput on SKX. `k` instructions are *not* cheap, can only run on one of p0 or p5. https://uops.info/ Ones like `kadd` that aren't vertical bitwise have 4 cycle latency, in case you're using them in a long dep chain. But even better in this case is to `kmov` to an integer register and back for `blsi`: https://godbolt.org/z/Y4bbecYzc (GCC's asm looks good, `vptestnmq` / `kmovd` / `blsi` / `kmovb` / masked broadcast. So only 4 uops that need p0/p5, the SIMD ALU ports. – Peter Cordes Jun 10 '23 at 03:44
  • 1
    Using https://uica.uops.info/ to count uops for ports (nevermind the overall "dependency" bottleneck it predicts since we aren't running this in a loop), we see your GCC version as 2p0 4p5 plus a mov-immediate. Clang uses 1 fewer uop, but apart from the mov-immediate all 5 of them are for p5. So it's actually a win if the surrounding code doesn't also bottleneck on port 5. (Unless it inlines differently into loops?) My version is 1p0 + 3p5, plus `blsi` which can run on port 1 (even when the vector ALUs on port 1 are shut down because of 512-bit uops in flight). – Peter Cordes Jun 10 '23 at 04:02
  • 1
    uiCA predicts a very short 4c critical-path latency for GCC's version, but I'm not sure that's accurate. The compiler-generated code isn't intended to be in a loop, but the merge result is in ZMM0, so the critical path from input to output would look like a loop-carried dependency. Depending on your use-case, latency might be important, but clang didn't optimize for it at least in the non-looping case. – Peter Cordes Jun 10 '23 at 04:06
  • The blsi variant seemed familiar and on checking my notes is near-identical with an earlier C version, which is how I came to be tweaking k registers in the first place. Alas it's slower in context, which I didn't give, perhaps I should have, since it's why I noticed at all: this is inlined in a loop and executed 2-3 trillion times with surrounding masked vector ops (both compilers are giving the same instructions inlined as they do otherwise). Thanks for diving so deep, I didn't know about the uiCA tool and that is a gonna be a new favourite bookmark for further analysis and experimentation. – inopinatus Jun 10 '23 at 05:04
  • Does your loop bottleneck on latency of this operation rather than throughput? I think clang's is worse for that. The only thing your context added is that the surrounding code is mostly vector uops. You didn't say anything about which ports they can run on, and masking doesn't limit that any narrower than for an unmasked version of the same uop. Your description could apply to a loop with one long dependency chain, like a masked dot product that wasn't unrolled with multiple accumulators, or to something with a short chain of independent work every iteration, with masking only inside that. – Peter Cordes Jun 10 '23 at 05:17
  • 1
    a fun one-line variant is _mm512_mask_expand_epi64(src, _mm512_testn_epi64_mask(src, src), (__m512i){val}) and it's so far the best I can get out of clang – inopinatus Jun 10 '23 at 05:21
  • Interestingly (perhaps unsurprisingly) the bottleneck varies depending on how I re-arrange the computation, for which I have a few options, but until now didn't know which were the *good* options, but now not only do I have the luxury of rearranging the code, you've given me a new compass to help assess it beyond instrumenting it invasively. Thank you very much indeed! There's an applicable parable here, about teaching someone to fish. – inopinatus Jun 10 '23 at 05:49
  • 1
    Cheers, glad I could help. For more detail about what to look for (latency vs. back-end vs. front-end bottlenecks), see [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) and [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/q/692718) and https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/ – Peter Cordes Jun 10 '23 at 06:16

0 Answers0