5

Consider the following code:

unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
    unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
    return d/c;
}

When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:

div:
    mov     rax, rdi
    mov     r8, rdx
    sub     rsp, 8
    xor     ecx, ecx
    mul     rsi
    mov     r9, rax
    mov     rsi, rdx
    mov     rdx, r8
    mov     rdi, r9
    call    __udivti3
    add     rsp, 8
    ret

At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?

Edit: Let's assume the quotient fits into 64-bits register.

Ecir Hana
  • 10,864
  • 13
  • 67
  • 117

1 Answers1

4

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.

Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).

You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.

Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Yes, I'm interested in the case when it cannot overflow, I even tried `unsigned __int128 d = (unsigned __int128)(a%c)*(unsigned __int128)(b%c)` which is even worse. What do you mean by `with a non-zero RDX`? That if the division is exact (i.e. no remainder), it will emit `DIVQ`? – Ecir Hana Jun 08 '20 at 08:47
  • @EcirHana: I mean a non-zero high-half *input* to division. [`div`](https://www.felixcloutier.com/x86/div) takes its dividend input in RDX:RAX. – Peter Cordes Jun 08 '20 at 08:51
  • Ah, ok. `DX` also contains the remainder after the division, so I was not sure. – Ecir Hana Jun 08 '20 at 08:56
  • 1
    @EcirHana: strictly, `DIVQ xx` will overflow if `RDX >= xx`, because the quotient is then 2^64 or more. But for all divisors (other than `0`) `RDX == 0` is OK. I tried adding `if ((d >> 64) >= c) return ULLONG_MAX ;` and `if ((d >> 64) != 0) return ULLONG_MAX ;` in front of the `return d/c ;`: the compiler was clever enough to test `RDX` in both cases, but still no `DIVQ` (with gcc 10.1 and clang 10.0.0) :-(. – Chris Hall Jun 08 '20 at 17:02