While watching this talk by Matt Godbolt, I was astonished to see that Clang, if instructed to compile for the Haswell¹ architecture, works out that the following code
int foo(int a) {
int count = 0;
while (a) {
++count;
a &= a - 1;
}
return count;
}
is for counting the set bits in an int
(I don't how long I'd have needed to work that out myself), so it just uses that instruction:
foo(int): # @foo(int)
popcntl %edi, %eax
retq
And I wanted to try myself, but I found that the generated code is
foo(int): # @foo(int)
popcntl %edi, %eax
cmovel %edi, %eax
retq
It turns out that generated code changed across Clang 10.0.1 and Clang 11.0.0.
Why is the newer Clang emitting one more instruction that wasn't needed before? The code is so simple that I can't understand how one more instruction can do anything else than making the code slower (even if by an amount which might be very small, I don't know).
¹ As a side question, does that fact that not specifying the -march=haswell
option results in a much longer, more human-like code, simply mean that the physical CPUs targeted by that option have circuitry for doing the set bit count and others (well, whatever clang defaults to) don't?