128bit by 64bit native division

Question

I need to perform 128bit by 64bit divisions in Rust. The x86-64 ISA contains a native DIV instruction for this purpose. However, my compiled test code doesn't use this instruction.

Test code:

pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
   assert!(hi < divisor);

   let dividend = ((hi as u128) << 64) + lo as u128;
   (dividend / divisor as u128) as u64
}

Compiler explorer output:

example::div:
    push    rax
    cmp     rdi, rdx
    jae     .LBB0_1
    mov     rax, rdi
    mov     rdi, rsi
    mov     rsi, rax
    xor     ecx, ecx
    call    qword ptr [rip + __udivti3@GOTPCREL]
    pop     rcx
    ret
.LBB0_1:
    ...

Instead, an inefficient 128bit by 128bit division is performed via __udivti3. This is probably because the DIV instruction causes a CPU exception if the quotient does not fit into 64 bits. In my case, however, this is impossible: hi < divisor, lo < 2^64 -> dividend = hi * 2^64 + lo <= (divisor - 1) * 2^64 + 2^64 - 1 = divisor * 2^64 - 1 -> dividend / divisor <= 2^64 - 1 / divisor < 2^64

How can I force the compiler to use the native instruction?

How many trillions of these operations are you doing that this performance hit matters? If you're in code that's extremely performance sensitive, you probably want to use SIMD anyway. — tadman, Sep 09 '22 at 22:08
As far as I know there's no way to force a narrowing div instruction except with an `unsafe` asm block. Related: https://stackoverflow.com/questions/62257103/emit-div-instruction-instead-of-udivti3 — stepan, Sep 09 '22 at 22:12
If this is needed as part of big int division (multiple divisions are performed with the same divisor), then I also suggest looking into division via multiplication. — stepan, Sep 09 '22 at 22:18
It's not as bad as you think. At least on my system, stepping through the `__udivti3` function, it performs a runtime test for precisely this case (`rcx == 0 && rdx > rsi`), and if so performs a single `div`. So all you pay is a couple extra ALU instructions and register shuffling, and some branches that will be well predicted if this is in a tight loop. Which I suspect are all small beans compared to the costly `div` itself. — Nate Eldredge, Sep 09 '22 at 22:39

score 6 · Accepted Answer · answered Sep 09 '22 at 23:25

Your only option is to use inline assembly. There might be an obscure combination of compiler flags that can coerce llvm into performing the optimization itself, but I don't think trying to find it is worth the effort. With assembly, it would look like this:

use std::arch::asm;
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {

    assert!(hi < divisor);

    #[cfg(target_arch = "x86_64")]
    unsafe {
        let mut quot = lo;
        let mut _rem = hi;
        asm!(
            "div {divisor}",
            divisor = in(reg) divisor,
            inout("rax") quot,
            inout("rdx") _rem,
            options(pure, nomem, nostack)
        );
        quot
    }
    #[cfg(not(target_arch = "x86_64"))]
    {
        let dividend = ((hi as u128) << 64) + lo as u128;
        (dividend / divisor as u128) as u64
    }
}

Godbolt

On x86_64, this just compiles the division down to a little register shuffling followed by a div, and performs the call to __udivti3 on other systems. It also shouldn't get in the way of the optimizer too much since it's pure.

It's definitely worth actually benchmarking your application to see if this actually helps. It's a lot easier for llvm to reason about integer division than inline assembly, and missed optimizations elsewhere could easily result in this version running slower than using the default version.

128bit by 64bit native division

1 Answers1