I don't think there's anything better than the naive approach for the maximum. One attempt is using the identity
x + y = min(x, y) + max(x, y)
and thus
max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))
This way, we can reduce the max function to the min function we already optimized, albeit with a few additional operations.
Here are some Rust implementations of the different approaches:
pub fn naive(x: u64, y: u64) -> u32 {
x.trailing_zeros().max(y.trailing_zeros())
}
pub fn sum_minus_min(x: u64, y: u64) -> u32 {
x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}
pub fn nielsen(x: u64, y: u64) -> u32 {
let x_lsb = x & x.wrapping_neg();
let y_lsb = y & y.wrapping_neg();
let xy_lsb = x_lsb | y_lsb;
let lsb = xy_lsb & xy_lsb.wrapping_neg();
let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
xy_max_lsb.trailing_zeros()
}
pub fn timmermans(x: u64, y: u64) -> u32 {
let loxs = !x & x.wrapping_sub(1);
let loys = !y & y.wrapping_sub(1);
return (loxs | loys).count_ones();
}
pub fn kealey(x: u64, y: u64) -> u32 {
((x | x.wrapping_neg()) & (y | y.wrapping_neg())).trailing_zeros()
}
Results on my machine:
ctz_max/naive time: [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min time: [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen time: [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans time: [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey time: [295.03 ns 295.93 ns 297.03 ns]
The naive implementation beats all other implementations. The only implementation that can compete with the naive one is the approach suggested by Martin Kealey. Note that the actual factors between the implementation may be even higher than the timings indicate, due to some overhead of the test harness.
It's clear that you only have like a couple of CPU instructions to spare to optimize the naive implementation, so I don't think there is anything you can do. For reference, here is the assembly emitted by the Rust compiler when these implementations are compiled as standalone functions on a modern x86_64 processor:
example::naive:
tzcnt rcx, rdi
tzcnt rax, rsi
cmp ecx, eax
cmova eax, ecx
ret
example::sum_minus_min:
tzcnt rcx, rdi
tzcnt rax, rsi
add eax, ecx
or rsi, rdi
tzcnt rcx, rsi
sub eax, ecx
ret
example::nielsen:
blsi rax, rdi
blsi rcx, rsi
or rcx, rax
blsi rax, rcx
xor edx, edx
cmp rcx, rax
cmovne rdx, rcx
xor rdx, rax
tzcnt rax, rdx
ret
example::timmermans:
lea rax, [rdi - 1]
andn rax, rdi, rax
lea rcx, [rsi - 1]
andn rcx, rsi, rcx
or rcx, rax
xor eax, eax
popcnt rax, rcx
ret
example::kealey:
mov rax, rdi
neg rax
or rax, rdi
mov rcx, rsi
neg rcx
or rcx, rsi
and rcx, rax
tzcnt rax, rcx
ret
In the benchmarks I ran, the functions get inlined, the loops partially unrolled and some subexpressions pulled out of the inner loops, so the assembly looks a lot less clean that the above.
For testing, I used Criterion. Here is the additional code:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
const NUMBERS: [u64; 32] = [
...
];
fn bench<F>(func: F)
where
F: Fn(u64, u64) -> u32,
{
for x in NUMBERS {
for y in NUMBERS {
black_box(func(x, y));
}
}
}
fn compare(c: &mut Criterion) {
let mut group = c.benchmark_group("ctz_max");
group.bench_function("naive", |b| b.iter(|| bench(naive)));
group.bench_function("sum_minus_min", |b| b.iter(|| bench(sum_minus_min)));
group.bench_function("nielsen", |b| b.iter(|| bench(nielsen)));
group.bench_function("timmermans", |b| b.iter(|| bench(timmermans)));
group.bench_function("kealey", |b| b.iter(|| bench(kealey)));
}
criterion_group!(benches, compare);
criterion_main!(benches);
NUMBERS
was generated with this Python code, with the intention of making branch prediction for the min()
function as hard as possible:
[
random.randrange(2 ** 32) * 2 ** random.randrange(32)
for dummy in range(32)
]
I'm running the benchmark using
RUSTFLAGS='-C target-cpu=native -C opt-level=3' cargo bench
on an 8th generation i7 processor (Whiskey Lake).