As title says, optimizing compilers transform division by constant (non-pow 2) into reciprocal multiplication to avoid time expensive div
instruction. I'm curious whether there is a CPU that does this optimization as code is being executed, caching frequently used divisor's 'magical constants' in a cache table and use them for computing div
, if divisor operand is present in cache.
I've run some experiment on x64 (C++) and M1 (Rust, below), repeatedly dividing a number by same divisor and dividing by ever-changing divisor in loop. Both ended up being identical in performance comparison, I guess answer is no at least for these CPUs I tested on.
type div32 = i32;
#[inline(never)]
fn div_bench_fn(dividend: div32, mut divisor: div32, mut step: div32) -> f64 {
step = std::hint::black_box(step);
const LOOP_COUNT: u32 = 100_000_000;
let start = std::time::Instant::now();
let mut tmp = 0;
for _ in 0..LOOP_COUNT {
let res = dividend % divisor;
divisor += step;
tmp ^= res;
std::hint::black_box(tmp);
}
let elapsed = start.elapsed().as_secs_f64();
let div_sec = (LOOP_COUNT as f64) / elapsed;
div_sec / 1_000_000.
}
fn main() {
for _ in 0..10 {
println!("--");
println!(" static {:.2}m div/sec", div_bench_fn(0x12345678, 127, 0));
println!(" dynamic {:.2}m div/sec", div_bench_fn(0x12345678, 127, 1));
}
}
Looping code (ARMv8.5-A)
divbench[0x1000017e8] <+64>: cbz w9, 0x10000185c ; <+180> // panic if divisor == 0
divbench[0x1000017ec] <+68>: sdiv w13, w11, w9
divbench[0x1000017f0] <+72>: msub w13, w13, w9, w11
divbench[0x1000017f4] <+76>: add w9, w9, w19
divbench[0x1000017f8] <+80>: eor w8, w13, w8
divbench[0x1000017fc] <+84>: str w8, [sp, #0xc]
divbench[0x100001800] <+88>: subs w10, w10, #0x1
divbench[0x100001804] <+92>: b.ne 0x1000017e8 ; <+64>
Results (higher is better in this one)
--
static 1018.66m div/sec
dynamic 1568.52m div/sec
--
static 1574.76m div/sec
dynamic 1574.63m div/sec
--
static 1575.77m div/sec
dynamic 1574.38m div/sec
--
static 1577.74m div/sec
dynamic 1581.09m div/sec
--
static 1585.35m div/sec
dynamic 1591.46m div/sec
--
static 1572.47m div/sec
dynamic 1585.13m div/sec
--
static 1560.05m div/sec
dynamic 1571.18m div/sec
--
static 1575.05m div/sec
dynamic 1550.98m div/sec
--
static 1556.27m div/sec
dynamic 1571.85m div/sec
--
static 1581.93m div/sec
dynamic 1579.43m div/sec
Probably I'm not the only one to think this but I haven't been able to find information on this.