For modern x86 with BMI1 you'd want to use 32-popcnt(x)
to count all the zeros in a whole register, and subtract lzcnt(x)
, the number of leading zeros, the ones that aren't part of the significant digits of your binary integer.
In GNU C, that looks like
int count_significant_zeros(unsigned x)
{
if (!x)
return 0; // __builtin_clz has an undefined result for input 0
return sizeof(x)*CHAR_BIT - __builtin_popcount(x) - __builtin_clz(x);
// assume no padding bits. Will be 32 when compiling for x86.
// use popcountll and clzll for 64-bit integers.
}
Without -mpopcnt
or SSE4.2, GCC and clang (Godbolt) use an SWAR bithack for __builtin_popcount
, and bsr(x) ^ 31
as an optimization for 31-bsr(x)
for the leading-zero count. Clang actually uses popcount(~x)
as an optimization for 32-popcount(x)
, using a not
instruction before the bithack. It probably would have been at least as efficient to do 32 - popcnt(x) - (31-bsr(x))
as 1 + bsr(x) - popcnt(x)
.
bsr
leaves the destination register unmodified for an input of 0, so you need to special-case that. (AMD documents this, Intel documents the result as "undefined", unlike lzcnt
which is well-defined to produce 32). Being able to use bsr
without extra checks is why GNU C defines __builtin_clz
to have an undefined result in that case. (https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
A C++20 version of this looks like this, using std::countl_zero
to count zeros on the left (leading), producing the type width for an input of 0 so we don't need to check for that special case ourselves. This lets the compiler make better asm when instructions like lzcnt
are available that don't need to special-case that.
#include <bit>
#include <limits>
int count_significant_zeros_cpp20(unsigned x)
{
return std::numeric_limits<decltype(x)>::digits - std::popcount(x) - std::countl_zero(x);
}
From clang 16 -O2 -march=haswell
for x86-64:
count_significant_zeros_cpp20(unsigned int):
popcnt eax, edi # missed optimization: false dependency could be avoided with popcnt edi,edi
lzcnt ecx, edi # Haswell also has a false output dependency on lzcnt, although Skylake doesn't.
add ecx, eax
mov eax, 32
sub eax, ecx
ret
If you wanted to use inline asm for this, you might do it this way
int count_significant_zeros_asm(unsigned x)
{
int retval, dummy;
asm(
"lzcnt %[dummy], %[x] \n\t" // false dependency on Intel before Skylake; could xor-zero [dummy] first to break it
"popcnt %[x], %[x] \n\t" // avoid false dependencies on Intel before Ice Lake
"mov %[ret], 32 \n\t"
"sub %[ret], %[dummy] \n\t" // latency of this can overlap with popcnt latency
"sub %[ret], %[x] \n\t"
: [ret]"=r" (retval), [dummy]"=&r"(dummy), // let the compiler pick registers, dummy being an early-clobber output written before we read x for the last time. Although [x] is an RMW inout operand so I don't think it can overlap
[x]"+r"(x) // modify the input reg in place
: // no pure inputs
: // no clobbers
);
return retval;
// dummy is also available with clz(x) in case you want it.
}
If you wanted to loop over bits, shifting right with shr
makes a lot more sense; you can stop looping as soon as the value becomes 0
, as in ikegami's rewrite of the community-wiki answer. (Which was originally just a std::popcount()
counting ones, not even counting down from 32 to count zeros. You could do that with sbb reg, 0
.)
You're shifting left with add
, which means you have to loop through the leading zeros in a fixed-width register before getting to the leading 1
of the integer.
BTW, clang 14 and later supports -masm=intel
for Intel syntax in asm("")
statements; before that it only affected the output syntax, not how its built-in assembler treated asm statements.