I have noticed that the unsafe intrinsics stabilised in Rust 1.55.0…
… both appear to be slower than the same functionality implemented with bit-shifts and bitwise AND
and OR
operators. Why might that be?
Consider the following snippet which uses _bittestandset64
to count all the zero bits of an i64
while simultaneously setting all bits to 1
:
for bit in 0..i64::BITS {
unsafe {
if 0 == _bittestandset64(&mut x as *mut i64, bit as i64) {
n += 1;
}
}
}
The same could be achieved without the intrinsic:
for bit in 0..i64::BITS {
let b: i64 = 1i64 << (bit as i64);
if 0 == (b & x) {
x |= b;
n += 1;
}
}
My experimentation shows that not using the intrinsic is orders of magnitude faster. It also does not require unsafe
.
I used criterion
(0.4.0) to benchmark the code and here is my benchmark implementation:
use core::arch::x86_64::{ _bittestandset64 };
#[inline(never)]
fn set_and_count_intrinsic(mut x: i64) -> (i64, u32) {
let mut n: u32 = 0;
for bit in 0..i64::BITS {
unsafe {
if 0 == _bittestandset64(&mut x as *mut i64, bit as i64) {
n += 1;
}
}
}
(x, n)
}
#[inline(never)]
fn set_and_count_operators(mut x: i64) -> (i64, u32) {
let mut n: u32 = 0;
for bit in 0..i64::BITS {
let b: i64 = 1i64 << (bit as i64);
if 0 == (b & x) {
x |= b;
n += 1;
}
}
(x, n)
}
fn bittestandset64(criterion: &mut criterion::Criterion) {
const PRIME: i64 = 7823;
let mut group = criterion.benchmark_group("bittestandset64");
group.bench_with_input("_bittestandset64", &PRIME, |bench, x| bench.iter(|| {
let r = set_and_count_intrinsic(*x);
criterion::black_box(r);
}));
group.bench_with_input("operators", &PRIME, |bench, x| bench.iter(|| {
let r = set_and_count_operators(*x);
criterion::black_box(r);
}));
group.finish();
}
criterion::black_box(…)
prevents the tool-chain from optimising away eitherx
orn
.criterion::black_box(…)
is within thefor …
loop to ensure that it cannot be calculated asconst
at compile-time.- The benchmark for the 32-bit version is identical, with only the obvious alterations.
The results speak for themselves, showing bit-twiddling with operators to be at least twice as fast as the intrinsics:
implementation | t lower | time | t upper |
---|---|---|---|
core::arch::…::_bittestandset |
53.080 ns | 53.171 ns | 53.305 ns |
bit-twiddling on i32 |
15.074 ns | 15.172 ns | 15.305 ns |
core::arch::…::_bittestandset64 |
114.99 ns | 115.13 ns | 115.37 ns |
bit-twiddling on i64 |
57.607 ns | 57.652 ns | 57.709 ns |
My test platform / environment:
rustc
version 1.64.0 (stable)- Windows 10
x64
- tested natively and under a WSL 2 environment
- Intel Core i7 processor (64-bit)
Why would there be this discrepancy and, if it is, indeed, expected, then why do the intrinsics exist at all – certainly, why are they exposed and stable in the library?