I've noticed several instances of clang disregarding the documented instructions of masked AVX-512 intrinsics and substituting slower instruction sequences. This really undermines the expectation of programmer control, otherwise, why bother using intrinsics?
Here's an egregious example I've encountered (godbolt) which led to a 3x slowdown with clang's output compared to gcc. Expecting this:
avx512_low_insert:
vptestnmq %zmm0, %zmm0, %k0
movl $1, %eax
kmovb %eax, %k2
knotb %k0, %k1
kaddb %k2, %k1, %k1
kandb %k1, %k0, %k1
vpbroadcastq %rdi, %zmm0 {%k1}
we instead obtain (with clang 16.x, current release at time of writing) the much more expensive:
avx512_low_insert:
vptestmq %zmm0, %zmm0, %k0
movb $1, %al
kmovd %eax, %k1
kaddb %k1, %k0, %k1
vptestnmq %zmm0, %zmm0, %k1 {%k1}
vpbroadcastq %rdi, %zmm0 {%k1}
Clang is essentially disregarding the intrinsics specified and substituting its own, inferior, ideas.
Short of hand-rolling inline asm, is there any way I can persuade it otherwise?