This is an MSVC missed optimization. GCC/clang can use bzhi
directly on the output of tzcnt
for your source. All compilers have missed optimizations in some cases, but GCC and clang tend to have fewer cases than MSVC.
(And GCC is careful to break the output dependency of tzcnt
when tuning for Haswell to avoid the risk of creating a loop-carried dependency chain through that false dependency. Unfortunately GCC still does this with -march=skylake
which doesn't have a false dep for tzcnt
, only popcnt
. Ironically, GCC doesn't break the "true" dependency for bsr/bsf
on any CPU.)
Intel documents the 2nd input to _bzhi_u64
as unsigned __int32 index
. (You're making this explicit with a static_cast
to uint32_t for some reason, but removing the explicit cast doesn't help). IDK how MSVC defines the intrinsic or handles it internally.
IDK why MSVC wants to do this; I wonder if it's zero-extension to 64-bit inside the internal logic of MSVC's _bzhi_u64
intrinsic that takes a 32-bit C input but uses a 64-bit asm register. (tzcnt
's output value-range is 0..64 so this zero-extension is a no-op in this case)
Masked popcnt: shift yyy
instead of masking it
As in What is the efficient way to count set bits at a position or lower?, it can be more efficient to just shift out the bits you don't want, instead of zeroing them in-place. (Although bzhi
avoids the cost of creating a mask so this is just break-even, modulo differences in which execution ports bzhi
vs. shlx
can run on.) popcnt
doesn't care where the bits are.
(FIXME: the C++ and asm are using a right shift, which discards low bits. I should have used left shift to shift out high bits. When I wrote this, I was probably thinking discarding low bits since tzcnt
counts low zeros in the other input. Left and right shift perform the same so I'm going to leave the answer as-is for now.)
uint64_t popcnt_shift(uint64_t xxx, uint64_t yyy) {
auto position = _tzcnt_u64(xxx);
auto shifted = yyy >> position;
return _mm_popcnt_u64(shifted);
}
MSVC on Godbolt
;; MSVC 19.24 -O2 -arch:AVX2 (to enable BMI for andn)
;; also clang10.0 -O3 -march=haswell makes this asm
unsigned __int64 popcnt_shift(unsigned __int64,unsigned __int64) PROC
tzcnt rax, rcx
shrx rax, rdx, rax
popcnt rax, rax
ret 0
3 total uops for the front end = very good for overall throughput when mixed with other surrounding code.
Back-end bottlenecks: 2 uops for port 1 (tzcnt and popcnt) on Intel CPUs. (shrx runs on port 0 or port 6, as a single uop. Enabling AVX2 which apparently enables BMI2 for MSVC is important, otherwise it will use 3-uop shr rax, cl
)
Critical path latency:
- from
yyy
to result: 1 for SHRX, 3 for popcnt = 4 cycles
- from
xxx
to result: 3 for TZCNT plus the above = 7 cycles
Unfortunately GCC is over-cautious about breaking false dependencies, costing extra front-end bandwidth. (But no extra back-end cost)
# GCC10.1
xor eax, eax # could have just done tzcnt rdi,rdi
tzcnt rax, rdi
shrx rsi, rsi, rax
xor eax, eax # pointless: RAX was already part of the dep chain leading to this.
popcnt rax, rsi # GCC7.5 shifts into RAX for popcnt rax,rax to avoid this dep-breaking xor.
ret
Lower latency alternatives without tzcnt
(But more uops, potentially worse front-end throughput. Back-end execution port pressure benefits depend on surrounding code.)
BMI1 has some bithack instructions to do stuff like isolate the lowest set bit, all 1 uop with single-cycle latency on Intel. (AMD Zen runs them as 2 uops, 2 cycle latency: uops.info)
blsmsk
- Get Mask Up to (and including) Lowest Set Bit. Your original is not inclusive of the LSB in xxx
so unfortunately this mask isn't directly usable.
uint64_t zmask_blsmsk(uint64_t xxx, uint64_t yyy) {
auto mask = _blsmsk_u64(xxx);
auto masked = yyy & ~(mask<<1);
return masked;
}
;; MSVC -O2 -arch:AVX2 (to enable BMI for andn)
blsmsk rax, rcx
add rax, rax ; left shift
andn rax, rax, rdx ; (~stuff) & yyy
ret 0
Or blsi
will Isolate the Lowest Set Bit. That blsi(xxx) - 1
will create a mask up to and not including it. (For xxx=1
, we'll get
uint64_t zmask2(uint64_t xxx, uint64_t yyy) {
auto setbit = _blsi_u64(xxx);
auto masked = yyy & ~(setbit-1); // yyy & -setbit
return masked;
}
MSVC compiles as expected, same as clang:
blsi rax, rcx
dec rax
andn rax, rax, rdx
ret 0
GCC uses the 2's complement identity to transform it into this, using shorter instructions that can run on any port. (andn
can only run on port 1 or port 5 on Haswell / Skylake)
;; GCC7.5 -O3 -march=haswell. Later GCC wastes a `mov` instruction
blsi rax, rdi
neg rax
and rax, rsi
This is 3 uops (not including popcnt) but only has 3 cycle latency from xxx
-> result, down from 4 for tzcnt
/ shrx
. (All of these are not counting the 3 cycle popcnt latency) And more importantly, it doesn't compete for port 1 with popcnt
.
(The way MSVC compiles it, to blsi
+ dec
+ andn
, is 2 uops for port 1 / port 5, though.)
The optimal choice will depend on the surrounding code, whether throughput or latency is the bottleneck.
If you're doing this for many different masks stored contiguously, SIMD could be effective. Avoiding tzcnt
means you can do the lowest-set isolate or mask with bithacks that take a couple instructions. e.g. blsi
is (-SRC) bitwiseAND (SRC)
, as documented in the Operation section of Intel's asm manual. (Handy place to look up bitmap expressions.) blsmsk
is (SRC-1) XOR (SRC)
SIMD popcnt can be done with vpshufb
to do 4-bit parallel LUTs on the two halves of each byte, and you can vpsadbw
to accumulate horizontally into counts for each element. (To emulate Ice Lake's AVX512 vpopcntq
)