Partial answer covering things other existing answers didn't, like why GCC apparently wastes a cltq
, and why xor-zeroing helps, and why GCC's code-gen with other options like -march=skylake
or -march=sandybridge
is not good.
The cltq
(aka cdqe
) is an unfortunate consequence of __builtin_ctzll()
being defined as returning an int
, not long long
.
bsf %rdi, %rax
either writes RAX with a number from 0..63, or leaves it unmodified (or on paper, holding an undefined value, but Intel CPUs in reality are compatible with the behaviour AMD documents: leave output register unmodified if input was 0 for bsf
or bsr
, unlike with tzcnt
/ lzcnt
).
__builtin_ctzll()
is only allowed to return valid int
values, even for input = 0. (GNU C specifies the builtin as "If x is 0, the result is undefined." Not "the behaviour", it's not UB, you're still guaranteed to get some 32-bit int
value. )
When GCC doesn't know for sure that rep bsf
will run as tzcnt
not bsf
, it has to cover the possibility that the destination held old garbage with high bits that aren't copies of bit 31, if you return uint64_t
instead of unsigned
or int
. (Returning a narrower type leaves it to the caller to ignore high garbage.)
In this case, where it's also xor-zeroing the destination, that guarantees a 0
output for a 0
input, so no need to sign-extend. Unless you want to play it safe in case Intel (or some software x86 emulator) stops doing what AMD documents and actually produces something other than the old value on input=0.
IIRC, rep bsf %edi, %eax
on Intel or AMD (I forget which) does always truncate RAX to 32-bit even if it leaves EAX unmodified. But on the other vendor it doesn't. So fun fact, if targeting only CPUs that do that bare minimum zero-extension, (uint64_t)(uint32_t)__builtin_ctz(x)
wouldn't need any extra work.
Output dependency for bsf
always, tzcnt
sometimes
GCC is kinda braindead sometimes: they carefully avoid the output dependency for tzcnt
on generic or Intel, but code-gen seems oblivious of the true dependency of bsf
if you compile with -march=sandybridge
or something that doesn't support TZCNT. (In which case it just uses bsf
, without xor-zeroing the destination.)
So apparently the xor-zeroing dep-breaking was only done in case it ran as TZCNT on Intel. Which is ironic because bsf
always has an output dependency, on all CPUs, because in practice, Intel CPUs are compatible with what AMD documents: input = 0 leaves the output unmodified for bsf/bsr. So like cmov they need to dst reg as an input.
https://godbolt.org/z/WcY3sManq shows that, and that GCC also doesn't know that -march=skylake
fixed the false dep for tz/lzcnt. (Why does breaking the "output dependency" of LZCNT matter?)