TL:DR: use a uint64_t
shift to implement efficiently with uint32_t
when compiling for 64-bit machines that have lzcnt
(AMD since K10, Intel since Haswell). Without lzcnt
(only bsr
that's baseline for x86) the n==0
case is still special.
For the uint64_t
version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt
producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)
Using a shift requires special-casing one leading-bit-position, typically the n==0
case. As Harold's answer shows, BMI2 bzhi
avoids that, allowing bit counts from 0..64.
Same for 32-bit operand-size shifts: they mask c&31
. But to generate a mask for uint32_t
, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)
This strategy is even more efficient than bzhi for a type narrower than register width.
// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good
#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
uint64_t mask32 = 0xffffffff; // (uint64_t)(uint32_t)-1u32
// this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
// If lznct isn't available, we can't avoid handling n==0 zero specially
uint32_t mask = mask32 >> _lzcnt_u32(n);
return n ^ mask;
}
#endif
This works equivalently for uint8_t
and uint16_t
(literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t
(You could use a unsigned __int128
shift, but shrd
masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same
to generate a 0
or -1
in a register as the mask to be shifted.)
Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32
with __builtin_clz
; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt
instruction1, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31
.
# clang 14 -O3 -march=haswell (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
lzcnt eax, edi # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt. Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
mov ecx, 4294967295
shrx rax, rcx, rax
xor eax, edi
ret
Without BMI2, e.g. with -march=bdver1
or barcelona
(aka k10), we get the same code-gen except with shr rax, cl
. Those CPUs do still have lzcnt
, otherwise this wouldn't compile.
(I'm curious if Intel Skylake Pentium/Celeron run lzcnt
as lzcnt
or bsf
. They lack BMI1/BMI2, but lzcnt
has its own feature flag.
It seems low-power uarches as recent as Tremont are missing lzcnt
, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)
Anyway, bzhi
also requires BMI2, so if you're comparing against that for any size but uint64_t
, this is the comparison.
This shrx
version can keep its -1
constant around in a register across loops. So the mov reg,-1
can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi
strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x))
is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).
Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n)
to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)
(TODO: finish this hand-written asm idea.)
Other C ideas for the uint64_t
case: cmov
or cmp/sbb
(to generate a 0
or -1
) in parallel with lzcnt
to shorten the critical path latency? See the Godbolt link where I was playing with that.
ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.
x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt
and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)
Footnote 1: clang codegen with __builtin_clz
or ...ll
Using __builtin_clzll(n)
will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt
. The 63-bsr
a compiler would use without lzcnt available would not produce the 64
we need for that case. Not unless you did n<<=1;
/ n|=1;
or something before the bsr
and adjusted the result, but that would be slower than cmov
.
If you were using a 64-bit lzcnt
, you'd want uint64_t mask = -1ULL
since there will be 32 extra leading zeros after zero-extending to uint64_t
. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL