Is there any difference in the efficiency of counting leading/trailing ones/zeros?

Question

I'm designing a prefixed variable length integer.

Rust has methods for counting leading and trailing ones and zeros: https://doc.rust-lang.org/std/primitive.u64.html#method.leading_zeros

Is there any difference in the efficiency of these methods on x86_64, arm32 and arm64?

e.g. If counting trailing ones is faster than trailing zeros, I will use xxxx0111 instead of xxxx1000 for the length encoding byte (for three following bytes, in this example).

`on x86_64, arm32 and arm64?` So whch _instruction sets_ are supported? The _best_ and the simplest way would be to just write a small program and inspect the generated assembly on these architectures. Please try it. Remember about [rules of optimization](https://wiki.c2.com/?RulesOfOptimizationClub). — KamilCuk, Nov 21 '20 at 21:07

Peter Cordes · Accepted Answer · 2022-11-01T21:36:12.373

9

Counting trailing zeros is faster than trailing ones on all 3 ISAs: x86*, ARM, AArch64. They all provide zero-counting instructions like x86 bsf (find the lowest set bit) or x86 BMI1 tzcnt (Trailing Zero Count). Counting leading/trailing ones in a runtime-variable would require negating the input.

ARM / AArch64 provide a leading-zero count, but the best option for trailing-zero is rbit / clz to bit-reverse (since ARMv6t2 or ARMv7). https://godbolt.org/z/Yr7eac. Before that, compilers have to isolate the lowest set bit with x & -x, count leading zeros of that, and take 31-clz(x&-x).

(On x86, counting leading zeros is most efficient with BMI1. Without it, bsr can give you position of the highest set bit, so the compiler needs 31-bsr(x) to implement clz. On AMD CPUs, bsf / bsr are significantly slower than their tzcnt / lzcnt counterparts, so if possible it's good to compile with -march=native or whatever the Rust equivalent is.)

On AMD Zen, lzcnt is 1 uop, tzcnt is 2 uops. https://uops.info/ Presumably the extra uop is either bit-reverse or isolate lowest set bit with x & -x, but either way generating something to feed to the lzcnt hardware.

Related fun fact: looping over the set bits to find their positions is often best done with x = blsr(x) as the loop-carried dependency, x &= x-1, and independently bit-scan each result for the set bit. So there isn't a bit-scan and shift in the critical path latency of the loop.

Related: Set the leading zero bits in any size integer C++ bsr/bsf and tzcnt performance on historical x86 CPUs.

edited Nov 01 '22 at 21:36

answered Nov 21 '20 at 21:42

Peter Cordes

328,167
45
605
847

Thanks, I hadn't come across godbolt before. – fadedbee Nov 22 '20 at 05:50
`rbit` is available from `armv7` – Jake 'Alquimista' LEE Nov 23 '20 at 09:39
@Jake'Alquimista'LEE: My Godbolt link shows GCC using `rbit` with `-march=armv6t2`. Oh, but apparently not with plain ARMv6. – Peter Cordes Nov 23 '20 at 09:57
Wow, interesting. A chess move generator relies on iterating through 1 bits in a given `u64`, and switching to iterating the `1` bits from LSB to MSB brought a measurable 3-5% performance increase (in Rust). – Marv Nov 01 '22 at 14:43
@Marv: What were you doing before, like `x <<= lzcnt(x)` as the loop-carried dependency? On x86, especially with [BMI1 for `blsr`](https://www.felixcloutier.com/x86/blsr), you can iterate `x = blsr(x)` (1 cycle latency on Intel, 2 on AMD https://uops.info/) to clear the lowest set bit, and *independently* `tzcnt(x)` on each x, so the loop-carried dependency chain is shorter. – Peter Cordes Nov 01 '22 at 21:30
@PeterCordes Before: `0x8000000000000000_u64.rotate_right(x.leading_zeros())` -- After: `1 << x.trailing_zeros()`. Just getting startet with Rust here :-) Might check out the bitintr crate which contains some more low level bit operations, though I'm not sure right now what the current code compiles to in terms of machine code. Possibly rustc is already being quite efficient. – Marv Nov 01 '22 at 21:58
@Marv: Rustc should be perfectly capable of compiling `x &= x-1` to `bslr` if you use `rustc -C target-cpu=native` or something that lets it use BMI1 instructions; you don't need an intrinsic for it. (Err, I guess you'd want `x.wrapping_sub(1)` because the default is that overflow isn't allowed.) – Peter Cordes Nov 01 '22 at 22:02
@Marv: `1 << x.trailing_zeros()` is just isolating the lowest set bit, isn't it? Should be even faster to do `x & -x`, either two cheap instructions (or maybe 3 including a `mov`) or x86 BMI1 `blsr`. But even with `mov`/`neg`/`and`, it's still cheaper than `bsf` / `mov reg,1` / `shl reg, cl`, especially on Intel where variable-count shifts are 3 uops without BMI2 `shlx`. Hopefully a smarter compiler would `bsf` / `xor eax,eax` / `bts eax, reg` to compile `1< – Peter Cordes Nov 01 '22 at 22:04

score -1 · Answer 2 · answered Feb 03 '23 at 12:18

On amd64 / x86_64, fastest to slowest are:

trailing zeros
leading zeros / trailing ones
leading ones

On arm64 / aarch64, fastest to slowest are:

leading zeros,
leading ones / trailing zeros (tied)
trailing ones

Test results from godbolt.org:

pub fn lz(num: u64) -> u32 {
    num.leading_zeros()
}

pub fn lo(num: u64) -> u32 {
    num.leading_ones()
}

pub fn tz(num: u64) -> u32 {
    num.trailing_zeros()
}

pub fn to(num: u64) -> u32 {
    num.trailing_ones()
}

amd64 / x86_64:

example::lz:
        test    rdi, rdi
        je      .LBB0_1
        bsr     rax, rdi
        xor     rax, 63
        ret
.LBB0_1:
        mov     eax, 64
        ret

example::lo:
        not     rdi
        test    rdi, rdi
        je      .LBB1_1
        bsr     rax, rdi
        xor     rax, 63
        ret
.LBB1_1:
        mov     eax, 64
        ret

example::tz:
        test    rdi, rdi
        je      .LBB2_1
        bsf     rax, rdi
        ret
.LBB2_1:
        mov     eax, 64
        ret

example::to:
        not     rdi
        test    rdi, rdi
        je      .LBB3_1
        bsf     rax, rdi
        ret
.LBB3_1:
        mov     eax, 64
        ret

arm64 / aarch64:

example::lz:
        clz     x0, x0
        ret

example::lo:
        mvn     x8, x0
        clz     x0, x8
        ret

example::tz:
        rbit    x8, x0
        clz     x0, x8
        ret

example::to:
        mvn     x8, x0
        rbit    x8, x8
        clz     x0, x8
        ret

Is there any difference in the efficiency of counting leading/trailing ones/zeros?

2 Answers2