Why are core::arch::x86_64::_bittestandset64 and …::_bittestandset slower than bit twiddling operators?

Question

I have noticed that the unsafe intrinsics stabilised in Rust 1.55.0…

… both appear to be slower than the same functionality implemented with bit-shifts and bitwise AND and OR operators. Why might that be?

Consider the following snippet which uses _bittestandset64 to count all the zero bits of an i64 while simultaneously setting all bits to 1:

        for bit in 0..i64::BITS {
            unsafe {
                if 0 == _bittestandset64(&mut x as *mut i64, bit as i64) {
                    n += 1;
                }
            }
        }

The same could be achieved without the intrinsic:

        for bit in 0..i64::BITS {
            let b: i64 = 1i64 << (bit as i64);
            if 0 == (b & x) {
                x |= b;
                n += 1;
            }
        }

My experimentation shows that not using the intrinsic is orders of magnitude faster. It also does not require unsafe.

I used criterion (0.4.0) to benchmark the code and here is my benchmark implementation:

use core::arch::x86_64::{ _bittestandset64 };

#[inline(never)]
fn set_and_count_intrinsic(mut x: i64) -> (i64, u32) {
    let mut n: u32 = 0;
    for bit in 0..i64::BITS {
        unsafe {
            if 0 == _bittestandset64(&mut x as *mut i64, bit as i64) {
                n += 1;
            }
        }
    }
    (x, n)
}
#[inline(never)]
fn set_and_count_operators(mut x: i64) -> (i64, u32) {
    let mut n: u32 = 0;
    for bit in 0..i64::BITS {
        let b: i64 = 1i64 << (bit as i64);
        if 0 == (b & x) {
            x |= b;
            n += 1;
        }
    }
    (x, n)
}

fn bittestandset64(criterion: &mut criterion::Criterion) {
    const PRIME: i64 = 7823;

    let mut group = criterion.benchmark_group("bittestandset64");
    group.bench_with_input("_bittestandset64", &PRIME, |bench, x| bench.iter(|| {
        let r = set_and_count_intrinsic(*x);
        criterion::black_box(r);
    }));
    group.bench_with_input("operators", &PRIME, |bench, x| bench.iter(|| {
        let r = set_and_count_operators(*x);
        criterion::black_box(r);
    }));
    group.finish();
}

criterion::black_box(…) prevents the tool-chain from optimising away either x or n.
criterion::black_box(…) is within the for … loop to ensure that it cannot be calculated as const at compile-time.
The benchmark for the 32-bit version is identical, with only the obvious alterations.

The results speak for themselves, showing bit-twiddling with operators to be at least twice as fast as the intrinsics:

implementation	t lower	time	t upper
`core::arch::…::_bittestandset`	53.080 ns	53.171 ns	53.305 ns
bit-twiddling on `i32`	15.074 ns	15.172 ns	15.305 ns
`core::arch::…::_bittestandset64`	114.99 ns	115.13 ns	115.37 ns
bit-twiddling on `i64`	57.607 ns	57.652 ns	57.709 ns

My test platform / environment:

rustc version 1.64.0 (stable)
Windows 10 x64
- tested natively and under a WSL 2 environment
Intel Core i7 processor (64-bit)

Why would there be this discrepancy and, if it is, indeed, expected, then why do the intrinsics exist at all – certainly, why are they exposed and stable in the library?

Since `X` is the only input and a known constant, are you sure the compiler isn't pre-calculating the result for the manual bit-twiddling case? 450ps is suspiciously fast. — kmdreko, Oct 05 '22 at 18:38
Indeed, there was a bug in the benchmark. The compiler was calculating the entire `for …` loop as `const`, at compile-time. I have employed `criterion::black_box()` more cunningly to prevent that and updated the table of times in the question. They now match my observations from my live code more accurately and also seem more sensible since the 64-bit results are about equal to double the 32-bit ones for both implementations. — @kmdreko — Xharlie, Oct 05 '22 at 19:21
I still haven't checked the generated output, but you didn't rectify the issue that `X` is the only input and a known constant. The differences could still be due to const-propagation. Check out [Benchmarking With Inputs](https://bheisler.github.io/criterion.rs/book/user_guide/benchmarking_with_inputs.html) for how to do that properly with criterion. — kmdreko, Oct 05 '22 at 19:52
I was already in the process of updating the code to use `bench_with_input()` – that's done, now, and the results and conclusion stand. `const`-propagation could not have affected the code after moving the calls to `black_box()` into the loop unless the compiler unrolled the whole loop, surely. — Xharlie, Oct 05 '22 at 20:00
Compilers can absolutely unroll the whole loop :) Anyways, thanks for clearing up those issues. Benchmarking is hard to get right. — kmdreko, Oct 05 '22 at 20:08
Just to do due diligence – and because anonymous calls in disassembly / LLVM-IR are not helpful – I've updated the snippet one final time to break out the implementations into separate, non-inlineable functions with no `const` in sight. (The results stand.) — Xharlie, Oct 05 '22 at 20:25
I've posted my benchmark, unredacted, to: https://gist.github.com/d82954a111f6bea9054513e749619401 — Xharlie, Oct 05 '22 at 20:43
By the way, if in your actual code you wanted to count the zeros, there's [a function](https://doc.rust-lang.org/std/primitive.i64.html#method.count_zeros) for that. And setting all non-one bits to one just means you always end up with `-1`. Wanted to bring it up just in case. — kmdreko, Oct 05 '22 at 20:53
The intrinsic might be forcing the compiler to use the memory-destination version of the instruction, not keeping the value in a register. That's much slower, slower even than fully emulating its crazy-CISC bit-string behaviour that can index outside the dword/qword selected by the addressing mode. [How can memory destination BTS be significantly slower than load / BTS reg,reg / store?](https://stackoverflow.com/q/63406150) — Peter Cordes, Oct 06 '22 at 01:54
On Intel CPUs, `bts reg,reg` is a single uop so it's fast; even on AMD it's only 2 uops for the reg,reg version. (And with `adc reg, 0` after, it might be a decent way to micro-optimize a loop if you insisted on counting bits the brute-force slow way. Instead of `64-popcnt(x)` / `x=-1` as @kmdreko says) — Peter Cordes, Oct 06 '22 at 01:55
For those without all the pieces installed, could you post the assembly generated by your benchmark program (`--emit asm`)? — Nate Eldredge, Oct 06 '22 at 02:57
Assembly emitted and added to my gist: https://gist.github.com/d82954a111f6bea9054513e749619401#file-bittestandset-1edfe46f36d51750-s You're looking for `set_and_count_intrinsic_64` and `set_and_count_operators_64`. The 32-bit versions do appear to be unrolled (although that only makes the operators version of that even more optimal) so the 64-bit versions are more sensible to compare. This assembly was emitted for the `x86_64-unknown-linux-gnu` target under the WSL 2 but perf. between WSL 2 and native Windows is consistent. — Xharlie, Oct 06 '22 at 06:16
@kmdreko — see `assert_eq!((!0, PRIME.count_zeros()), r);` in the gist which is what I was using to sanity-check my benchmark implementation. The real application that prompted this investigation was using a `Box<[u64]>` used as a compact visit-map in a 4-neighbourhood flood-fill algorithm to track whether indices had been visited. `_bittestandset64()` seemed like the logical way to use single bits to implement a visit-only-once mechanism but, when I implemented the `arch`-specific enhancement, I noticed it was slower – even in the real scenario – and set out to find out why. — Xharlie, Oct 06 '22 at 06:26
Further to last: as seen in the assembly output, the 32-bit case gets unrolled by the compiler and that gives the `set_and_count_operators_32()` variant a huge advantage over `set_and_count_intrinsic_32()` and over all 64-bit variants. In the real case of the flood-fill's visit-map, the compiler would *not* be able to unroll it and so the 64-bit benchmark is more relevant -- it is not-unrolled in the assembly I'm seeing from `rustc`. — Xharlie, Oct 06 '22 at 06:30

Why are core::arch::x86_64::_bittestandset64 and …::_bittestandset slower than bit twiddling operators?

0 Answers0