Avoid unnecessary mov ecx, ecx instruction in bzhi(y, tzcnt(x))

Question

I have a bit position (it's never zero), calculated by using tzcnt and I would like to zero high bits starting from that position. This is code in C++ and disassembly (I'm using MSVC):

auto position = _tzcnt_u64(xxx); 
auto masked =_bzhi_u64(yyy, static_cast<uint32_t>(position));

tzcnt       rcx,rdx  
mov         ecx,ecx  
bzhi        rax,rbx,rcx

BZHI accepts unsigned int as second parameter, but only uses bits [7..0] from rcx, so this 'mov' instruction is unnecessary in my opinion.

I use this to later calculate popcount, so I could also use something like <<(64-position) instead.

Problem is - these two codes have the same execution time, although bzhi should perform faster than sub+shlx, so mov probably makes the difference.

Is there a way to avoid it or is this compiler thing?

Hmm, neither gcc, nor clang include this mov, so it does seem an MSVC thing: https://godbolt.org/z/6sP2Yy — user1937198, May 24 '20 at 13:38

Peter Cordes · Accepted Answer · 2023-07-18T16:41:43.353

This is an MSVC missed optimization. GCC/clang can use bzhi directly on the output of tzcnt for your source. All compilers have missed optimizations in some cases, but GCC and clang tend to have fewer cases than MSVC.

(And GCC is careful to break the output dependency of tzcnt when tuning for Haswell to avoid the risk of creating a loop-carried dependency chain through that false dependency. Unfortunately GCC still does this with -march=skylake which doesn't have a false dep for tzcnt, only popcnt. Ironically, GCC doesn't break the "true" dependency for bsr/bsf on any CPU.)

Intel documents the 2nd input to _bzhi_u64 as unsigned __int32 index. (You're making this explicit with a static_cast to uint32_t for some reason, but removing the explicit cast doesn't help). IDK how MSVC defines the intrinsic or handles it internally.

IDK why MSVC wants to do this; I wonder if it's zero-extension to 64-bit inside the internal logic of MSVC's _bzhi_u64 intrinsic that takes a 32-bit C input but uses a 64-bit asm register. (tzcnt's output value-range is 0..64 so this zero-extension is a no-op in this case)

Masked popcnt: shift `yyy` instead of masking it

As in What is the efficient way to count set bits at a position or lower?, it can be more efficient to just shift out the bits you don't want, instead of zeroing them in-place. (Although bzhi avoids the cost of creating a mask so this is just break-even, modulo differences in which execution ports bzhi vs. shlx can run on.) popcnt doesn't care where the bits are.

(FIXME: the C++ and asm are using a right shift, which discards low bits. I should have used left shift to shift out high bits. When I wrote this, I was probably thinking discarding low bits since tzcnt counts low zeros in the other input. Left and right shift perform the same so I'm going to leave the answer as-is for now.)

uint64_t popcnt_shift(uint64_t xxx, uint64_t yyy) {
    auto position = _tzcnt_u64(xxx); 
    auto shifted = yyy >> position;
    return _mm_popcnt_u64(shifted);
}

MSVC on Godbolt

;; MSVC 19.24 -O2 -arch:AVX2  (to enable BMI for andn)
;; also clang10.0 -O3 -march=haswell  makes this asm
unsigned __int64 popcnt_shift(unsigned __int64,unsigned __int64) PROC
        tzcnt   rax, rcx
        shrx    rax, rdx, rax
        popcnt  rax, rax
        ret     0

3 total uops for the front end = very good for overall throughput when mixed with other surrounding code.

Back-end bottlenecks: 2 uops for port 1 (tzcnt and popcnt) on Intel CPUs. (shrx runs on port 0 or port 6, as a single uop. Enabling AVX2 which apparently enables BMI2 for MSVC is important, otherwise it will use 3-uop shr rax, cl) Critical path latency:

from yyy to result: 1 for SHRX, 3 for popcnt = 4 cycles
from xxx to result: 3 for TZCNT plus the above = 7 cycles

Unfortunately GCC is over-cautious about breaking false dependencies, costing extra front-end bandwidth. (But no extra back-end cost)

# GCC10.1
        xor     eax, eax          # could have just done tzcnt rdi,rdi
        tzcnt   rax, rdi
        shrx    rsi, rsi, rax
        xor     eax, eax          # pointless: RAX was already part of the dep chain leading to this.
        popcnt  rax, rsi          # GCC7.5 shifts into RAX for popcnt rax,rax to avoid this dep-breaking xor.
        ret

Lower latency alternatives without `tzcnt`

(But more uops, potentially worse front-end throughput. Back-end execution port pressure benefits depend on surrounding code.)

BMI1 has some bithack instructions to do stuff like isolate the lowest set bit, all 1 uop with single-cycle latency on Intel. (AMD Zen runs them as 2 uops, 2 cycle latency: uops.info)

blsmsk - Get Mask Up to (and including) Lowest Set Bit. Your original is not inclusive of the LSB in xxx so unfortunately this mask isn't directly usable.

uint64_t zmask_blsmsk(uint64_t xxx, uint64_t yyy) {
    auto mask = _blsmsk_u64(xxx); 
    auto masked = yyy & ~(mask<<1);
    return masked;
}

;; MSVC -O2 -arch:AVX2  (to enable BMI for andn)
        blsmsk  rax, rcx
        add     rax, rax               ; left shift
        andn    rax, rax, rdx          ; (~stuff) & yyy
        ret     0

Or blsi will Isolate the Lowest Set Bit. That blsi(xxx) - 1 will create a mask up to and not including it. (For xxx=1, we'll get

uint64_t zmask2(uint64_t xxx, uint64_t yyy) {
    auto setbit = _blsi_u64(xxx); 
    auto masked = yyy & ~(setbit-1);  // yyy & -setbit
    return masked;
}

MSVC compiles as expected, same as clang:

        blsi    rax, rcx
        dec     rax
        andn    rax, rax, rdx
        ret     0

GCC uses the 2's complement identity to transform it into this, using shorter instructions that can run on any port. (andn can only run on port 1 or port 5 on Haswell / Skylake)

;; GCC7.5 -O3 -march=haswell.   Later GCC wastes a `mov` instruction
        blsi    rax, rdi
        neg     rax
        and     rax, rsi

This is 3 uops (not including popcnt) but only has 3 cycle latency from xxx -> result, down from 4 for tzcnt / shrx. (All of these are not counting the 3 cycle popcnt latency) And more importantly, it doesn't compete for port 1 with popcnt.

(The way MSVC compiles it, to blsi + dec + andn, is 2 uops for port 1 / port 5, though.)

The optimal choice will depend on the surrounding code, whether throughput or latency is the bottleneck.

If you're doing this for many different masks stored contiguously, SIMD could be effective. Avoiding tzcnt means you can do the lowest-set isolate or mask with bithacks that take a couple instructions. e.g. blsi is (-SRC) bitwiseAND (SRC), as documented in the Operation section of Intel's asm manual. (Handy place to look up bitmap expressions.) blsmsk is (SRC-1) XOR (SRC)

SIMD popcnt can be done with vpshufb to do 4-bit parallel LUTs on the two halves of each byte, and you can vpsadbw to accumulate horizontally into counts for each element. (To emulate Ice Lake's AVX512 vpopcntq)

(You're making this explicit with a static_cast to uint32_t... -this is just to remove warning for possible loss of data. — Marka, May 25 '20 at 07:23
Blsmsk should do the trick. I can change code a little to include bit at position. So instead of 7 cycles, this should take 5C, at least on Intel. — Marka, May 25 '20 at 07:27
@Marka: Ok perfect, `blsmsk` + `andn` + `popcnt` is probably ideal. Note that "take 5 cycles" only describes *latency* (from the `xxx` input being ready), which isn't always the bottleneck if there's any instruction-level parallelism. There's no single measure you can just add up across instructions to get a total cost. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391). Latency is *not* a synonym for "performance" in this context. — Peter Cordes, May 25 '20 at 07:40
I'm mostly interested in latency for this particular case. You have been most helpful. — Marka, May 25 '20 at 07:56

score 1 · Answer 2 · answered Jun 15 '20 at 17:31

1

It is a compiler thing (as of Visual C++ 2019 00435-60000-00000-AA388).
MSVC's immintrin.h defines

__int64 _bzhi_u64(unsigned __int64, unsigned int);

following Intel's suboptimal intrinsic definition that contradicts command documentation (all bzhi params are of the same size).
clang has in bmi2intrin.h

unsigned long long _bzhi_u64(unsigned long long __X, unsigned long long __Y)

and so does not see the need to touch the _tzcnt_u64 result in your code.

I patched MSVC's immintrin.h - to no avail. Sad! Because Peter's sophisticated workarounds do not apply to my case (lzcnt/bzhi, no popcnt).

answered Jun 15 '20 at 17:31

pshufb

11
2

Narrow args are allowed to have high garbage. If MSVC internally thought that `bzhi` would only look at ECX not the full RCX, then it wouldn't need to zero-extend ECX to RCX. Perhaps the problem is that it thinks `tzcnt` only produces a 32-bit output that might not be zero-extended; if you were going to change anything, I'd look at the `_tzcnt_u64` definition. (Although as an intrinsic, I'd expect some stuff about it is hard-coded into the compiler, not parsed from the prototype.) – Peter Cordes Jul 18 '23 at 16:46

Avoid unnecessary mov ecx, ecx instruction in bzhi(y, tzcnt(x))

2 Answers2

Masked popcnt: shift yyy instead of masking it

Lower latency alternatives without tzcnt

The optimal choice will depend on the surrounding code, whether throughput or latency is the bottleneck.

Masked popcnt: shift `yyy` instead of masking it

Lower latency alternatives without `tzcnt`