0

I have a few utilities that promote 128 bit operations like 64x64 bit multiplications that I have successfully translated to the equivalent __uint128_t arithmetic with carry. I checked and gcc generates exactly the same assembly instructions.

However I am failing to find the equivalent C/C++ code for the case of 128-bit division. The original asm statement is this:

#include <stdint.h>
void div64(uint64_t low, uint64_t high, uint64_t divisor,
                         uint64_t& quotient, uint64_t& remainder) {
    asm("divq %2" : "+a"(low), "+d"(high) : "rm"(divisor));
    quotient = low;
    remainder = high;
}

However when I try to replace it with this

void div64( uint64_t low, uint64_t high, uint64_t divisor,
            uint64_t& quotient, uint64_t& remainder) {
    __uint128_t val = __uint128_t(low) | (__uint128_t(high)<<64);
    quotient = val/divisor;
    remainder = val%divisor;
}    

The compiler fails to match these operations to a single divq instruction. Instead, it produces a much longer version with an additional library call to __udivti3, even in -O3 optimization mode.

Godbolt Link

div64b(unsigned long, unsigned long, unsigned long, unsigned long&, unsigned long&):                       # @div64b(unsigned long, unsigned long, unsigned long, unsigned long&, unsigned long&)
        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rbx
        pushq   %rax
        movq    %r8, %r14
        movq    %rcx, %r15
        movq    %rdx, %r12
        movq    %rdi, %rbx
        xorl    %ecx, %ecx
        callq   __udivti3@PLT
        movq    %rax, (%r15)
        imulq   %r12, %rax
        subq    %rax, %rbx
        movq    %rbx, (%r14)
        addq    $8, %rsp
        popq    %rbx
        popq    %r12
        popq    %r14
        popq    %r15
        retq

Am I missing anything?

Note: I tagged this question with both C and C++ because the code above is valid on both languages.

Something Something
  • 3,999
  • 1
  • 6
  • 21
  • 1
    Does this answer your question? https://stackoverflow.com/questions/62257103/emit-div-instruction-instead-of-udivti3 – user253751 Aug 12 '22 at 00:16
  • Even if GCC could prove that the quotient will fit in a uint64_t (and thus not `#DE` -> SIGFPE), it misses the optimization. (i.e. that `high < divisor`). In your case it's not even possible since the pure C has to safely truncate the result, not raise an exception. e.g. try your inline asm with `0x10000:0000000000000000 / 1` or something. – Peter Cordes Aug 12 '22 at 00:17
  • @user253751 I do not think it answers the question as I still believe there would be a way to rewrite that code without an asm statement such that the compiler will generate an optimized function. – Something Something Aug 12 '22 at 00:37
  • @PeterCordes it would be up to the precondition. Every C/C++ function has a set of preconditions that would crash it if violated. – Something Something Aug 12 '22 at 00:40
  • 1
    Ok, then that makes your inline asm function safe to use *in your program*. But clearly, even if GCC were smarter, it couldn't optimize a stand-alone version of this function that way, because the pure C version has well-defined behaviour for some inputs where a single div would fault. It *could* after inlining, if the pre-condition were something it could prove at compile time. But unfortunately even with a guarantee like that, it won't do the optimization (unless something has changed.) – Peter Cordes Aug 12 '22 at 00:44

0 Answers0