5

Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.

If anything changes in the future, I will accept it as an answer.

This is a very related question for MSVC.


In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.

The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.

uint32_t div_c(uint64_t a, uint32_t b) {
    return a / b;
}

However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.

uint32_t div_c_ur(uint64_t a, uint32_t b) {
    uint64_t q = a / b;
    if (q >= 1ull << 32) __builtin_unreachable();
    return q;
}

For both div_c and div_c_ur, the output from gcc is,

mov     rax, rdi
mov     esi, esi
xor     edx, edx
div     rsi
ret

clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.

        mov     rax, rdi
        mov     ecx, esi
        mov     rdx, rdi
        shr     rdx, 32
        je      .LBB0_1
        xor     edx, edx
        div     rcx
        ret
.LBB0_1:
        xor     edx, edx
        div     ecx
        ret

I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.

__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
    "mov eax, edi\n\t"
    "mov rdx, rdi\n\t"
    "shr rdx, 32\n\t"
    "div esi\n\t"
    "ret\n\t"
);}

Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.

This is the benchmark code.

#include <stdint.h>
#include <stdio.h>

__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
    uint64_t q = a / b;
    if (q >= 1ull << 32) __builtin_unreachable();
    return q;
}

__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
    "mov eax, edi\n\t"
    "mov rdx, rdi\n\t"
    "shr rdx, 32\n\t"
    "div esi\n\t"
    "ret\n\t"
);}

static uint64_t rdtscp() {
    uint32_t _;
    return __builtin_ia32_rdtscp(&_);
}

int main() {
    #define n 500000000ll
    uint64_t c;

    c = rdtscp();
    for (int i = 1; i <= n; ++i) {
        volatile uint32_t _ = div_c(i + n * n, i + n);
    }
    printf("  c%15ul\n", rdtscp() - c);

    c = rdtscp();
    for (int i = 1; i <= n; ++i) {
        volatile uint32_t _ = div_asm(i + n * n, i + n);
    }
    printf("asm%15ul\n", rdtscp() - c);
}
xiver77
  • 2,162
  • 1
  • 2
  • 12
  • 3
    It's a fair question. Off the top of my head, I might guess that this optimization is just not implemented. There's probably not very many instances in real-life code where the compiler would be able to prove no overflow. Especially since cases with a constant divisor are already likely to be optimized into magic-number multiplication which is better anyway. – Nate Eldredge Jan 30 '22 at 17:59
  • 2
    @NateEldredge MSVC [provides an intrinsic](https://learn.microsoft.com/en-us/cpp/intrinsics/udiv64?view=msvc-170) `_udiv64` for this operation. I agree that implementing this optimization on plain C code isn't worth much, but providing an intrinsic to do this would be a useful tool, as they currently do with some operations that the C language doesn't support natively (`popcnt`, `lzcnt`, `rol`/`ror`, etc.) – xiver77 Jan 30 '22 at 18:24
  • 2
    I agree, it'd be nice if gcc/clang provided an intrinsic, since [they seem to have one for just about every other x86 instruction in existence](https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions). But extended asm is powerful enough that you can basically create your own, that should optimize just about as well as if it were an intrinsic: https://godbolt.org/z/qYnE6e5TE . (clang seems to fall victim to https://github.com/llvm/llvm-project/issues/20571 though, doing unnecessary spills to use a memory operand where register would be permitted.) – Nate Eldredge Jan 30 '22 at 19:00
  • Throw in a `__builtin_constant_p` branch to let it optimize to multiplication when the divisor is constant, and fold when both inputs are constant: https://godbolt.org/z/P37so695W . – Nate Eldredge Jan 30 '22 at 19:05
  • @NateEldredge I reformatted your comments to an answer. Thanks for letting me know a very useful feature of `gcc`. – xiver77 Jan 30 '22 at 23:51
  • 3
    Some related reports: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79219 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64308 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58897 – Marc Glisse Jan 31 '22 at 00:57
  • Note that the benefit of this optimization is *much* larger on Skylake and earlier Intel, where `div r64` is microcoded as 36 uops, vs. only 10 for `div r64`, and with much worse throughput and latency. See [Fast hardware integer division](https://stackoverflow.com/q/70132913) / [Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux](https://stackoverflow.com/a/52558274). But Ice Lake changed. `div r64` is still lower throughput, but the same number of uops. (AMD only cares about the size of the numbers; the worst case can be worse for r64 but same for same input.) – Peter Cordes Jan 31 '22 at 02:48

1 Answers1

1

Every idea in this answer is based on comments by Nate Eldredge, from which I discovered some powerfulness of gcc's extended inline assembly. Even though I still have to write assembly, it is possible to create a custom as-if intrinsic function.

static inline uint32_t divqd(uint64_t a, uint32_t b) {
    if (__builtin_constant_p(b)) {
        return a / b;
    }
    uint32_t lo = a;
    uint32_t hi = a >> 32;
    __asm__("div %2" : "+a" (lo), "+d" (hi) : "rm" (b));    
    return lo;
}

__builtin_constant_p returns 1 if b can be evaluated in compile-time. +a and +d means values are read from and written to a and d registers (eax and edx). rm specifies that the input b can either be a register or memory operand.

To see if inlining and constant propagation is done smoothly,

uint32_t divqd_r(uint64_t a, uint32_t b) {
    return divqd(a, b);
}

divqd_r:
        mov     rdx, rdi
        mov     rax, rdi
        shr     rdx, 32
        div esi
        ret

uint32_t divqd_m(uint64_t a) {
    extern uint32_t b;
    return divqd(a, b);
}

divqd_m:
        mov     rdx, rdi
        mov     rax, rdi
        shr     rdx, 32
        div DWORD PTR b[rip]
        ret

uint32_t divqd_c(uint64_t a) {
    return divqd(a, 12345);
}

divqd_c:
        movabs  rdx, 6120523590596543007
        mov     rax, rdi
        mul     rdx
        shr     rdx, 12
        mov     eax, edx
        ret

and the results are satisfying (https://godbolt.org/z/47PE4ovMM).

xiver77
  • 2,162
  • 1
  • 2
  • 12
  • Nice. Using `"r,m"` multi-part alternative constraints can help avoid clang's missed-optimization where it always picks memory for `"rm"` constraints. (While still allowing GCC to pick either according to where the operand is). https://godbolt.org/z/fccc396ar clang does end up choosing a register even when that takes a separate `mov` load, but that's fine because clang doesn't support Intel-syntax inline asm so it wouldn't get an operand-size from the memory operand. (Using `divl` could fix that.) Anyway, this doesn't hurt GCC, and is better for clang. (Other operands become `"+a,a"` etc) – Peter Cordes Jan 31 '22 at 02:59
  • (Aside from syntax problems, if you have to choose always-memory or always-register for clang, it makes most sense to choose always-register. Sometimes you get an extra load that costs you an extra scratch register instead of folding the load into an extra uop for the `div`, but you avoid ever spilling something from a register just for `div` to reload it.) – Peter Cordes Jan 31 '22 at 03:02