How to divide two unsigned long 64 bit values in x86 assembly and then returning the quotient and remainder to a C program

Question

In a separate C program, I have passed 4 parameters to an x86 ASM program.

dividend
divisor
Quotient pointer
Remainder pointer

dividend = 0xA
divisor = 0x3

Which is 10/3.

The quotient should be 3 and the remainder should be 1.

However, my quotient is returning c2 and my remainder is returning 7ffff396f687. Both of which are extremely far off of what i should be getting. I've tried debugging my ASM code and I can't figure out what the problem is.

This is what I have so far. I'm a beginner at this.

global divide64u
divide64u:
push rbp
mov rbp, rsp
mov rdx, rdi
mov rax, rsi
xor rdx, rdx 
div r10
divide64uDone:
pop rbp
ret

Why are you dividing by `r10`? I am not aware of any calling convention where the third argument is in `r10`. — fuz, May 01 '22 at 21:06
Are you trying to do 128-bit / 64-bit division (because you know that the quotient will fit in a uint64_t but the compiler doesn't)? Is that why you're using asm in the first place, instead of just looking at compiler output for `uint64_t` division? If you're not using a 128-bit dividend, you should be zeroing RDX, not copying an arg into it. — Peter Cordes, May 02 '22 at 04:40
I've just figured it out!!! Thanks, to all the kind people willing to help me out! — Boar, May 02 '22 at 17:55

Craig Estey · Answer 1 · 2022-05-02T19:51:57.303

For x86_64, the args are all passed in registers, per the ABI.

So, no need to push/pop rbp/rsp

You can actually code this in C and the compiler optimizer will generate the most efficient code:

typedef unsigned long long u64;

void
davdiv(u64 div,u64 dvr,u64 *quot,u64 *rem)
{

    __asm__ (
        "\tdiv %[dvr]\n"
    :   [quot] "=a" (*quot),
        [rem] "=d" (*rem)
    :   [div] "a" (div),
        [dvr] "r" (dvr),
        "d" (0));
}

void
mydiv(u64 div,u64 dvr,u64 *quot,u64 *rem)
{

    __asm__ __volatile__(
        "\txor  %%edx,%%edx\n"
        "\tmov  %[div],%%rax\n"
        "\tdiv  %[dvr]\n"
        "\tmov  %%rax,%[quot]\n"
        "\tmov  %%rdx,%[rem]\n"
    :   [quot] "=m" (*quot),
        [rem] "=m" (*rem)
    :   [div] "r" (div),
        [dvr] "r" (dvr)
    :   "rax", "rdx");
}

void
cpldiv(u64 div,u64 dvr,u64 *quot,u64 *rem)
{

    *quot = div / dvr;
    *rem = div % dvr;
}

u64
cplretA(u64 div,u64 dvr,u64 *rem)
{
    u64 quot;

    quot = div / dvr;
    *rem = div % dvr;

    return quot;
}

u64
cplretB(u64 div,u64 dvr,u64 *quot)
{
    u64 rem;

    *quot = div / dvr;
    rem = div % dvr;

    return rem;
}

Here is the disassembly of the above compiled with -O2:


div2.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <davdiv>:
   0:   49 89 d0                mov    %rdx,%r8
   3:   48 89 f8                mov    %rdi,%rax
   6:   31 d2                   xor    %edx,%edx
   8:   48 f7 f6                div    %rsi
   b:   49 89 00                mov    %rax,(%r8)
   e:   48 89 11                mov    %rdx,(%rcx)
  11:   c3                      retq
  12:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  19:   00 00 00 00
  1d:   0f 1f 00                nopl   (%rax)

0000000000000020 <mydiv>:
  20:   49 89 d0                mov    %rdx,%r8
  23:   31 d2                   xor    %edx,%edx
  25:   48 89 f8                mov    %rdi,%rax
  28:   48 f7 f6                div    %rsi
  2b:   49 89 00                mov    %rax,(%r8)
  2e:   48 89 11                mov    %rdx,(%rcx)
  31:   c3                      retq
  32:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  39:   00 00 00 00
  3d:   0f 1f 00                nopl   (%rax)

0000000000000040 <cpldiv>:
  40:   48 89 f8                mov    %rdi,%rax
  43:   48 89 d7                mov    %rdx,%rdi
  46:   31 d2                   xor    %edx,%edx
  48:   48 f7 f6                div    %rsi
  4b:   48 89 07                mov    %rax,(%rdi)
  4e:   48 89 11                mov    %rdx,(%rcx)
  51:   c3                      retq
  52:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  59:   00 00 00 00
  5d:   0f 1f 00                nopl   (%rax)

0000000000000060 <cplretA>:
  60:   48 89 d1                mov    %rdx,%rcx
  63:   48 89 f8                mov    %rdi,%rax
  66:   31 d2                   xor    %edx,%edx
  68:   48 f7 f6                div    %rsi
  6b:   48 89 11                mov    %rdx,(%rcx)
  6e:   c3                      retq
  6f:   90                      nop

0000000000000070 <cplretB>:
  70:   48 89 d1                mov    %rdx,%rcx
  73:   48 89 f8                mov    %rdi,%rax
  76:   31 d2                   xor    %edx,%edx
  78:   48 f7 f6                div    %rsi
  7b:   48 89 01                mov    %rax,(%rcx)
  7e:   48 89 d0                mov    %rdx,%rax
  81:   c3                      retq

How about: `__asm__ ("\tdiv %[dvr]\n" : [quot] "=a" (*quot), [rem] "=d" (*rem) : [div] "a" (div), [dvr] "r" (dvr), "d" (0));`? Let the compiler handle moving everything around for you. Also, isn't there some issue with overflow? — David Wohlferd, May 02 '22 at 01:00
@DavidWohlferd Nice. I've added that to the example code. But, I'm not sure about overflow. The asm code is the same even for `cpldiv` which is 100% C. The fact that compiler can combine `/` and `%` into a single operation is a common optimization. I've never seen code that checks any flags here (e.g. OF, CF, etc.) — Craig Estey, May 02 '22 at 19:55
Consider what happens if you take a 128 bit number and divide it by 2. The result is a 127 bit number and it just doesn't fit in a 64 bit register. I believe the processor faults (rather like using a bad pointer). — David Wohlferd, May 02 '22 at 20:30
@DavidWohlferd No, the processor doesn't fault--it just truncates the result. But we can _never_ have a 128 bit number. Only 64 (because of `u64 div`). That's why we have `xor %edx,%edx`. We're starting with 64 bit numbers. That is, `div` (dividend) is _zero_ extended to 128 bits before the `div` inst. — Craig Estey, May 02 '22 at 20:38
Hmm. The [docs](https://www.felixcloutier.com/x86/div) are telling me: *Overflow is indicated with the #DE (divide error) exception rather than with the CF flag.* — David Wohlferd, May 02 '22 at 22:10
@DavidWohlferd is correct; x86 does indeed raise `#DE` if the quotient doesn't fit in the operand-size (AL/AX/EAX/RAX). This is impossible for `div r/m64` *if* you use it with RDX=0, except for the special case of division by zero. But no, `div` itself doesn't zero-extend, you need to manually zero the high half of the dividend if you don't want to take advantage of the full power of 128-bit / 64-bit => 64-bit div, e.g. for N-chunk / 1-chunk extended precision, as explained in [Why should EDX be 0 before using the DIV instruction?](https://stackoverflow.com/q/38416593) — Peter Cordes, May 03 '22 at 03:33
Duplicates of that are frequent, with people naively using `div` without zeroing [er]DX or AH first. Or `idiv` without `cdq` or `cqo` to sign-extend RAX into RDX:RAX. `INT_MIN / -1` can actually overflow `idiv`'s quotient, though, even with the dividend being only single-width sign-extended: that kind of thing is why it's UB in C. See [Why does integer division by -1 (negative one) result in FPE?](https://stackoverflow.com/q/46378104) for details of the puzzle pieces involved on x86 vs. not faulting on ARM. — Peter Cordes, May 03 '22 at 03:39

How to divide two unsigned long 64 bit values in x86 assembly and then returning the quotient and remainder to a C program

1 Answers1