0

As title says, optimizing compilers transform division by constant (non-pow 2) into reciprocal multiplication to avoid time expensive div instruction. I'm curious whether there is a CPU that does this optimization as code is being executed, caching frequently used divisor's 'magical constants' in a cache table and use them for computing div, if divisor operand is present in cache.

I've run some experiment on x64 (C++) and M1 (Rust, below), repeatedly dividing a number by same divisor and dividing by ever-changing divisor in loop. Both ended up being identical in performance comparison, I guess answer is no at least for these CPUs I tested on.

type div32 = i32;

#[inline(never)]
fn div_bench_fn(dividend: div32, mut divisor: div32, mut step: div32) -> f64 {
    step = std::hint::black_box(step);
    const LOOP_COUNT: u32 = 100_000_000;
    let start = std::time::Instant::now();
    let mut tmp = 0;
    for _ in 0..LOOP_COUNT {
        let res = dividend % divisor;
        divisor += step;
        tmp ^= res;
        std::hint::black_box(tmp);
    }
    let elapsed = start.elapsed().as_secs_f64();
    let div_sec = (LOOP_COUNT as f64) / elapsed;
    div_sec / 1_000_000.
}

fn main() {
    for _ in 0..10 {
        println!("--");
        println!("  static {:.2}m div/sec", div_bench_fn(0x12345678, 127, 0));
        println!("  dynamic {:.2}m div/sec", div_bench_fn(0x12345678, 127, 1));
    }
}

Looping code (ARMv8.5-A)

divbench[0x1000017e8] <+64>:  cbz    w9, 0x10000185c           ; <+180> // panic if divisor == 0
divbench[0x1000017ec] <+68>:  sdiv   w13, w11, w9
divbench[0x1000017f0] <+72>:  msub   w13, w13, w9, w11
divbench[0x1000017f4] <+76>:  add    w9, w9, w19
divbench[0x1000017f8] <+80>:  eor    w8, w13, w8
divbench[0x1000017fc] <+84>:  str    w8, [sp, #0xc]
divbench[0x100001800] <+88>:  subs   w10, w10, #0x1
divbench[0x100001804] <+92>:  b.ne   0x1000017e8               ; <+64>

Results (higher is better in this one)

--
  static 1018.66m div/sec
  dynamic 1568.52m div/sec
--
  static 1574.76m div/sec
  dynamic 1574.63m div/sec
--
  static 1575.77m div/sec
  dynamic 1574.38m div/sec
--
  static 1577.74m div/sec
  dynamic 1581.09m div/sec
--
  static 1585.35m div/sec
  dynamic 1591.46m div/sec
--
  static 1572.47m div/sec
  dynamic 1585.13m div/sec
--
  static 1560.05m div/sec
  dynamic 1571.18m div/sec
--
  static 1575.05m div/sec
  dynamic 1550.98m div/sec
--
  static 1556.27m div/sec
  dynamic 1571.85m div/sec
--
  static 1581.93m div/sec
  dynamic 1579.43m div/sec

Probably I'm not the only one to think this but I haven't been able to find information on this.

Iwa
  • 512
  • 1
  • 5
  • 18
  • 2
    I haven't heard of any CPUs that do this, but it's actually a clever idea. At least, if there's time to generate the constants "in the background" for that divisor before the code is done using it. It's something that could get profiled with a simulator for some real-world workloads to see if they often happen to use the same divisor enough for it to stay hot in a reasonable-sized cache with pseudo-LRU replacement. (Perhaps some logic to try to index by code address, since some `div` instructions are in loops using the same divisor, and also speculatively probing the cache could reduce lat) – Peter Cordes Dec 30 '22 at 10:59
  • That would be a lot of die space and instruction decode logic for something that can be done by the compiler, and for a less than common circumstance (division by a constant, common, but not that common). You can be sure that if the hardware could optimise it in a single instruction, the compiler would not do so with _multiple_ instructions, since the div would then be optimal already. It would be contrary to the design principle of RISC architectures I think. Better suited to https://electronics.stackexchange.com/ perhaps? – Clifford Dec 30 '22 at 11:00
  • 2
    @Clifford: You're missing the point: computing the constants takes longer than just dividing; it's only worth doing if you can actually cache them somewhere, not just use them for one instruction. So compilers can't do that, except maybe in the special case of a divide in a loop using the same divisor. (Then worth considering, although in C the norm is to leave that up to the programmer to use libdivide). Handling the general case of any divisor isn't that cheap in software, some need more shifts and fixups, although it's still better than the alternative, but hardware could do that better. – Peter Cordes Dec 30 '22 at 11:05
  • 2
    @Clifford: Also, die area is cheap these days as transistors get so small that power-density is the main problem; we *want* more [dark silicon](https://en.wikipedia.org/wiki/Dark_silicon) that only switches in rare cases, but speeds up those cases in a useful way. That's why we have more and more specialized instructions that need execution units which rarely get used. Making L1d and L2 cache bigger are ways to spend more transistors, but addressing larger caches can slow them down. A separate cache like a divisor cache could just sit there most of the time, only costing leakage power. – Peter Cordes Dec 30 '22 at 11:09
  • 1
    This idea could maybe work for the mantissa divisors in FP / SIMD division units, too; naive (or strict-FP) code that doesn't hoist an FP reciprocal out of a loop could maybe be sped up a lot. – Peter Cordes Dec 30 '22 at 11:11
  • @PeterCordes perhaps. I would assume a compiler only performs this optimisation for _constant_ divisors where the coefficients are calculated at compile time not by run-time code. Perhaps I misunderstood as you suggest, but presenting the test code in Rust is not helping (me). The point about die-space stands however. A compiler might reasonably detect a loop-invariant and do something, but I doubt the processor could. – Clifford Dec 30 '22 at 11:16
  • @PeterCordes regarding die-space, perhaps in the targets tested, but I work primarily with MCU where peripherals, flash and SRAM are all competing for space, and the economies of scale, cost-sensitivity, and yields do not justify advanced high-density processes. Die space is not always the cheap resource you suggest. – Clifford Dec 30 '22 at 11:22
  • 1
    @Clifford: Indeed, compilers do this for compile-time constant divisors ([Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935)), but not cases like dividing every array element by the same integer; you have to use libdivide manually if you want to speed that up. https://godbolt.org/z/hvh9q8T86 . – Peter Cordes Dec 30 '22 at 11:36
  • 2
    Re: die space, I'm of course talking about high-end CPUs like M1 and contemporary x86. As Mysticial mentions in [his Zen 4 microarchitecture review](https://www.mersenneforum.org/showthread.php?p=614191), single-uop `vpmullq` and `vpermt2b zmm` (byte-shuffle of 64 elements from 128) is one way for AMD to show off their process-size advantages and throw dark silicon at those instructions. Of course you won't throw die space at this in an MCU; it's probably only running one program, and devs can optimize the software if it's not fast enough for its one job. Unlike general-purpose CPUs. – Peter Cordes Dec 30 '22 at 11:37
  • 1
    To avoid some very amusing side-channel attacks, such a division cache feature will have to be disabled by default, and only enabled on request by an operating system which promises to flush the cache on every context switch. So if it existed (which as noted, it does not), it couldn't be a secret microarchitectural feature; it'd need an enable bit and a cache-flush instruction, and you'd find those in the architecture manual. – Nate Eldredge Jan 02 '23 at 07:19

0 Answers0