Why newer clang is generating one more instruction than just popcntl to count the bits of an int on haswell architecture?

Question

While watching this talk by Matt Godbolt, I was astonished to see that Clang, if instructed to compile for the Haswell¹ architecture, works out that the following code

int foo(int a) {
    int count = 0;
    while (a) {
        ++count;
        a &= a - 1;
    }
    return count;
}

is for counting the set bits in an int (I don't how long I'd have needed to work that out myself), so it just uses that instruction:

foo(int):                                # @foo(int)
        popcntl %edi, %eax
        retq

And I wanted to try myself, but I found that the generated code is

foo(int):                                # @foo(int)
        popcntl %edi, %eax
        cmovel  %edi, %eax
        retq

It turns out that generated code changed across Clang 10.0.1 and Clang 11.0.0.

Why is the newer Clang emitting one more instruction that wasn't needed before? The code is so simple that I can't understand how one more instruction can do anything else than making the code slower (even if by an amount which might be very small, I don't know).

¹ As a side question, does that fact that not specifying the -march=haswell option results in a much longer, more human-like code, simply mean that the physical CPUs targeted by that option have circuitry for doing the set bit count and others (well, whatever clang defaults to) don't?

`__builtin_popcount(a)` still produces only one instruction. So I assume it's just the optimizer screwing up, not a conscious decision. Note that there is a false dependency on the output register in some Intel CPUs that [Clang does not work around](https://github.com/llvm/llvm-project/issues/33216) but GCC does. The `cmov` isn't really helpful in this regard, either, I think. — Homer512, Aug 28 '23 at 08:09
clang 16: `cmove`. clang 15: no `cmove`. clang 14: no `cmove`. clang 13: no `cmove`. clang 12: `cmove`. clang 11: `cmove`. clang 10 and below: no `cmove`. Consider submitting this to OEIS. — n. m. could be an AI, Aug 28 '23 at 09:05
Possibly an (incorrect??) attempt to circumvent [this reported Haswell/POPCNT bug](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011)? But I would expect the `cmove` to be *before* the `popcnt`, if that's the issue. — Adrian Mole, Aug 28 '23 at 09:24
Maybe it's a compiler bug. Code like this is required for `bsf` and `bsr` to get consistent results when the input is zero, perhaps someone copy-pasted the wrong code generation rule. — fuz, Aug 28 '23 at 09:52
Please don't tag questions as C when you are actually using a C++ compiler. — Lundin, Aug 28 '23 at 13:37
@AdrianMole That's the gcc bugzilla however, not for clang. But it seems there's a silicon errata for the CPU. gcc from version 9.0 generates `xor eax, eax`, `popcnt eax, edi` likely because of the silicon bug. Before version 9.0, `gcc -march=haswell -O3` doesn't seem to optimize the code into using `popcnt` at all. — Lundin, Aug 28 '23 at 13:48
@Lundin Yeah, I know the site I linked is for GCC but maybe the clang folks tried to (later) address the same issue. Just guessing, really ... and I'm sure the clang people would know what order in which to put those two instructions. — Adrian Mole, Aug 28 '23 at 13:50
@AdrianMole: `cmove` reads FLAGS from `popcnt` so wouldn't make sense before, and also would still have EAX as an input dependency, which is what compilers should be trying to avoid. That's why GCC does `xor eax,eax` / `popcnt eax, edi` unless you use `-mtune=znver2` or something. Clang is still doing `cmove` with `-march=znver2`, so hopefully it's a different missed-optimization / regression, not related to the false output dependency of lzcnt/tzcnt/popcnt on Sandybridge family that was fixed in Skylake for lz/tzcnt, and fixed in Ice Lake for popcnt. — Peter Cordes, Aug 28 '23 at 18:28
@Lundin: GCC since 4.9 works around the false output dependency for popcnt/lzcnt/tzcnt (but not the true output dependency for `bsr` / `bsf` ...) with xor-zeroing for the destination or doing it in-place if it can use the result there. https://godbolt.org/z/qj66Ednne . GCC only recognizes that idiom (where the loop trip-count is the popcount) since 9.0, so I had to use `__builtin_popcount` to see the code-gen. Clang also avoids the false dependency in loops inside functions, but is [aggressive / reckless about false dependencies across functions](https://stackoverflow.com/q/70856597). — Peter Cordes, Aug 28 '23 at 18:34
The key point of `-march=haswell` is `-mpopcnt`, telling the compiler the target supports the popcnt instruction. Without that, it chooses to keep the loop as written, not re-implement it with the SWAR bithack it uses for `__builtin_popcount`. But even with tuning for totally different ISAs like `-march=znver1` (Zen 1) we still get this useless extra instruction from clang. https://godbolt.org/z/hTG5v5o18 — Peter Cordes, Aug 28 '23 at 18:39
You can report this regression as a missed-optimization bug on LLVM's issue tracker. Or don't because someone already did in April: https://github.com/llvm/llvm-project/issues/62450 . Apparently it was fixed in Clang 15 (or just got lucky), but broken again in Clang 16. — Peter Cordes, Aug 28 '23 at 18:42
For C++, prefer C++20 `std::popcount` from the `` header, or assign to a `std::bitset<32>` to portably get access to a target-optimized `popcnt` from the standard library. It's only in C where you're really stuck with idiom recognition if you want to avoid `#ifdef __GNUC__`. [Count the number of set bits in a 32-bit integer](https://stackoverflow.com/q/109023) This clang bug is specific to idiom-recognition; `__builtin_popcount` still compiles without an extra instruction. — Peter Cordes, Aug 28 '23 at 19:08
Unrelated bug: `a` needs to be declared `unsigned`, else for negative inputs, signed overflow occurs and causes UB. — Nate Eldredge, Aug 29 '23 at 13:33

Why newer clang is generating one more instruction than just popcntl to count the bits of an int on haswell architecture?

0 Answers0

Linked