Different codegen for 'i % 2 == 0' between clang and GCC, why is it compiled this way?

Question

I was curious whether the compilers do the obvious optimization for code like N % <some factor of 2> equals / not equals 0. Indeed they do, but there are some interesting nuances, so here are two questions:

Why do GCC and MSVC produce seemingly more complex output than clang for the case of i % 2 (https://godbolt.org/z/KaWrYoz1a), but equal - simpler - output for the factors of 4, 8, 16?

clang output (expected):

test    dil, 1
sete    al

GCC output:

mov     rax, rdi
not     rax
and     eax, 1

MSVC is the same, but different order between mov / not.

Why does MSVC produce much more complex code for signed integers (both 32 and 64-bit) for factors of 4 / 8 (https://godbolt.org/z/xP1qcch8q)? Clang and GCC are unaffected by signedness.

mov     rax, rcx
cdq
and     edx, 3
add     rax, rdx
and     eax, 3
cmp     rax, rdx
sete    al

And why does MSVC still produce the simple code for the factor of 2?

P. S. The compilers are at their maximum O level, but I am not specifying any architecture (-march) flags.

The `clang` output should be preceeded by `xor eax, eax` to avoid the false dependency on the `sete` instruction. — fuz, May 29 '22 at 20:41
@fuz: very interesting point! Is it a compiler bug / missed optimization? Does it mean the GCC's variant is more correct because it avoids that problem? — Violet Giraffe, May 29 '22 at 20:43
I was curious as to what Intel C++ Compiler Classic would generate... and it's different from the other 3. Curiouser and curiouser. — Eljay, May 29 '22 at 21:05
@VioletGiraffe I suspect such an instruction was issued above the code you cited. Can you provide a [mcve]? — fuz, May 29 '22 at 21:17
@fuz - I have provided the example, there is a Godbolt link in each of the two questions. Only the Intel compiler issues a XOR, but not for all factors. — Violet Giraffe, May 29 '22 at 21:22
@fuz: Clang is normally reckless about false dependencies except in loops within a single function (sometimes even using partial registers for no reason). This is normally not a problem but it can sometimes bite it: [Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?](https://stackoverflow.com/q/60688348). (VioletGiraffe - it's not a *correctness* problem, "only" performance). Clang takes a risk by creating a false dependency, with the benefit being fewer front-end uops. — Peter Cordes, May 30 '22 at 01:27
All these compilers are missing the 2-instruction way that avoids false deps or partial register stalls: `lea eax, [rdi+1]` / `and eax, 1`. The low bit of EAX is opposite that of the EDI input. The `sum` output of a half-adder is just XOR of the inputs. (XOR is add-without carry, add is xor + carry that makes higher bits different, but the low bit is the same. Or more specifically, the all bits are the same between xor and add up to the point where the inputs both have a set bit.) See also [Check if a number is even](https://stackoverflow.com/q/64112337) — Peter Cordes, May 30 '22 at 01:31
(I should have filed missed-optimization bugs with GCC and clang when answering that question a couple years ago. It's pretty normal to see imperfect optimizations. And with MSVC, not rare to see braindead asm like this. Compilers in general aren't truly intelligent, they rely on matching patterns and stuff like that.) — Peter Cordes, May 30 '22 at 01:53
@GoswinvonBrederlow: You're thinking of C++03 (and C89). Since C++11 and C99, `/` and `%` rounding semantics have been standardized to round towards 0, with `(a/b)*b + a%b` equal to `a` (matching what CPUs and older mainstream implementations already did). — Peter Cordes, May 30 '22 at 03:10
@PeterCordes The codegen for msvc looks to me like it's compensating for that while gcc/clang just check the lower bits. — Goswin von Brederlow, May 30 '22 at 08:12
Yes, MSVC is clearly failing to optimize away work based on the fact that we're comparing against zero. The sign of the result doesn't matter, but agreed, it looks like MSVC is doing it anyway. It's widely known to be not as good at optimizing and GCC and Clang. e.g. https://www.agner.org/optimize/blog/read.php?i=1015 — Peter Cordes, May 30 '22 at 08:14

Different codegen for 'i % 2 == 0' between clang and GCC, why is it compiled this way?

0 Answers0