2

The 8086/8087/8088 MACRO ASSEMBLY LANGUAGE REFERENCE MANUAL (c) 1980 Intel Corporation mentions (Emphasis mine):

... the 8087 provides a very good approximation of the real number system. It is important to remember, however, that it is not an exact representation, and that arithmetic on real numbers is inherently approximate.
Conversely, and equally important, the 8087 does perform exact arithmetic on its integer subset of the reals. That is, an operation on two integers returns an exact integral result, provided that the true result is an integer and is in range.

A more recent manual is even more concise (Emphasis theirs):

the IA processors ... They can process decimal numbers of up to 18 digits without round-off errors, performing exact arithmetic on integers as large as 2^64 (or 10^18).

The integer data types that the FPU supports include signed word (16 bit), signed dword (32 bit), and signed qword (64 bit). There is never any mention of UNsigned. In fact, everything about the FPU breathes signedness, to the extent that they even support signed zero (+0 and -0).
So, is it possible to use the FPU to divide a couple of unsigned 64-bit numbers and get an exact quotient and remainder for it?

For the division of a couple of signed 64-bit numbers I wrote next code. The quotient seems fine but the remainder is always returned zero. Why is this?

; IN (edx:eax,ecx:ebx) OUT (edx:eax,ecx:ebx,CF)
FiDiv:  push    edi ecx ebx edx eax
        mov     edi, esp
        fninit
        fild    qword [edi]     ; Dividend
        fild    qword [edi+8]   ; Divisor
        fld
        fnstcw  [edi]
        or      word [edi], 0C00h ; Truncate Towards Zero
        fldcw   [edi]
        fdivr   st2             ; st0 = st2 / st0
        fld                     ; Duplicate because `fistp` does pop
        fistp   qword [edi]     ; Quotient
        fmulp                   ; st1 *= st0, pop st0
        fsubp                   ; st1 -= st0, pop st0
        fistp   qword [edi+8]   ; Remainder
        fnstsw  ax
        test    ax, 101b        ; #Z (Divide-By-Zero), #I (Invalid Arithmetic)
        pop     eax edx ebx ecx edi
        jnz     .NOK
        ret                     ; CF=0
.NOK:   stc                     ; Overflow on 8000000000000000h / -1
        ret                     ; or Division by zero
; ------------------------------

The FPU rounding mode is set to 'Truncate Towards Zero' aka 'Chop', so as to mimic the behavior of the ALU idiv instruction.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76

1 Answers1

5

Zero remainder

fdivr   st2             ; st0 = st2 / st0
fld                     ; Duplicate because `fistp` does pop
fistp   qword [edi]     ; Quotient
fmulp                   ; st1 *= st0, pop st0
fsubp                   ; st1 -= st0, pop st0
fistp   qword [edi+8]   ; Remainder

This code calculates the remainder from:

Remainder = Dividend - (Quotient * Divisor)

Because the Rounding Mode got set to 'Truncate Towards Zero', the fistp qword [edi] instruction will convert the (copy of the) quotient held in ST0 to an integer right before storing to memory. However the value for the quotient that remains on the (fpu) stack is still a real number with a fraction. Once multiplied with the divisor it will produce the dividend again, resulting in a zero remainder.
What is missing is rounding the quotient to an integer and doing it on the (fpu) stack already:

fdivr   st2             ; st0 = st2 / st0
frndint
fld                     ; Duplicate because `fistp` does pop
fistp   qword [edi]     ; Quotient
fmulp                   ; st1 *= st0, pop st0
fsubp                   ; st1 -= st0, pop st0
fistp   qword [edi+8]   ; Remainder

But a faster way is to simply reload the integer quotient from memory:

fdivr   st2             ; st0 = st2 / st0
fistp   qword [edi]     ; Quotient
fild    qword [edi]
fmulp                   ; st1 *= st0, pop st0
fsubp                   ; st1 -= st0, pop st0
fistp   qword [edi+8]   ; Remainder

Unsigned division

Internally, the FPU dedicates 64 bits to the significand of a number plus a separate bit for the sign of the number. The FPU can represent every whole number in the range from -18'446744073'709551616 to 18'446744073'709551616. The 64-bit significand allows us to work with the unsigned 64-bit integers that range from 0 to 18'446744073'709551615. The only trouble is how to load and store these values that fild and fistp can't deal with (because they are restricted to operate in the range from -9'223372036'854775808 to 9'223372036'854775807).
It would be possible to convert back and forth between unsigned quadword and the extended real format so we could use fld and fstp instead. Another approach would load/store the unsigned quadwords from/to their upper and lower halves. But conversions take time and so I found that by eliminating enough special cases, the only troublesome operation that remains was the loading of the dividend. Everything else can just use fild and fistp as usual.

The special cases include:

  • Division by zero: CF=1
  • Division by one: Quotient=Dividend and Remainder=0
  • Divisions where the Dividend is equal to the Divisor: Quotient=1 and Remainder=0
  • Divisions where the Dividend is smaller than the Divisor: Quotient=0 and Remainder=Dividend

Where an actual fdiv is needed, the code first loads half of the dividend, doubles it back on the (fpu) stack, and conditionally adds 1 if the true dividend was odd.

; IN (edx:eax,ecx:ebx) OUT (edx:eax,ecx:ebx,CF)
FuDiv:  cmp     ebx, 1
        jbe     .TST            ; Divisor could be 0 or 1
.a:     cmp     edx, ecx
        jb      .LT             ; Dividend < Divisor
        ja      .b              ; Dividend > Divisor
        cmp     eax, ebx
        jb      .LT             ; Dividend < Divisor
        je      .GE             ; Dividend = Divisor
.b:     test    ecx, ecx
        js      .GE             ; Dividend > Divisor > 7FFFFFFFFFFFFFFFh

        shr     edx, 1          ; Halving the unsigned 64-bit Dividend
        rcr     eax, 1          ; (CF)      to get in range for `fild`
        push    edi ecx ebx edx eax
        mov     edi, esp
        fninit
        fild    qword [edi]     ; st0 = int(Dividend / 2)
        fadd    st0             ; st0 = {Dividend - 1, Dividend}
        jnc     .c              ; (CF)
        fld1
        faddp                   ; st0 = Dividend [0, FFFFFFFFFFFFFFFFh]
.c:     fild    qword [edi+8]   ; Divisor is [2, 7FFFFFFFFFFFFFFFh]
        fld
        fnstcw  [edi]
        or      word [edi], 0C00h ; Truncate Towards Zero
        fldcw   [edi]
        fdivr   st2             ; st0 = st2 / st0
        fistp   qword [edi]     ; Quotient
        fild    qword [edi]
        fmulp                   ; st1 *= st0, pop st0
        fsubp                   ; st1 -= st0, pop st0
        fistp   qword [edi+8]   ; Remainder
        pop     eax edx ebx ecx edi
        ret                     ; CF=0

.TST:   test    ecx, ecx
        jnz     .a
        cmp     ebx, 1          ; Divisor is 0 or 1
        jb      .NOK
.ONE:   dec     ebx             ; Remainder ECX:EBX is 0
.NOK:   ret

.GE:    sub     eax, ebx        ; Remainder ECX:EBX is Dividend - Divisor
        sbb     edx, ecx
        mov     ebx, eax
        mov     ecx, edx
        mov     eax, 1          ; Quotient EDX:EAX is 1
        cdq
        ret                     ; CF=0

.LT:    mov     ebx, eax        ; Remainder ECX:EBX is Dividend
        mov     ecx, edx
        xor     eax, eax        ; Quotient EDX:EAX is 0
        cdq
        ret                     ; CF=0
; ------------------------------

Faster results

Quoting from an answer that uses a technique named Chunking to divide a couple of 64-bit integers:

Even if you are using a 64-bit data type, practice shows me that the majority of divisions in a (general purpose) program could still do with just using the built-in div instruction. And that is why I prefixed my code with a detection mechanism that checks whether the divisor in ECX:EBX is less than 4GB (so fitting EBX) and that the dividend's extension in EDX is less than the divisor in EBX. If these conditions are met, the normal div instruction does the job, and does it faster too. If for some reason (eg. school) using div is not allowed, then simply remove the prefixed code to be in the clear.

Today's code can benefit from the same prefix, but as it turns out, it is more beneficial to keep detecting the special cases first:

; IN (edx:eax,ecx:ebx) OUT (edx:eax,ecx:ebx,CF)
FuDiv:  cmp     ebx, 1
        jbe     .TST            ; Divisor could be 0 or 1
.a:     cmp     edx, ecx
        jb      .LT             ; Dividend < Divisor
        ja      .b              ; Dividend > Divisor
        cmp     eax, ebx
        jb      .LT             ; Dividend < Divisor
        je      .GE             ; Dividend = Divisor
.b:     test    ecx, ecx
        js      .GE             ; Dividend > Divisor > 7FFFFFFFFFFFFFFFh
; - - - - - - - - - - - - - - -
        jnz     .fdiv
        cmp     edx, ebx
        jnb     .fdiv
.div:   div     ebx             ; EDX:EAX / EBX --> EAX Quotient, EDX Remainder
        mov     ebx, edx        ; Remainder to ECX:EBX
        xor     edx, edx        ; Quotient to EDX:EAX
        ret                     ; CF=0
; - - - - - - - - - - - - - - -
.fdiv:  shr     edx, 1          ; Halving the unsigned 64-bit Dividend
        rcr     eax, 1          ; (CF)      to get in range for `fild`
        push    edi ecx ebx edx eax
        mov     edi, esp
        fninit
        fild    qword [edi]     ; st0 = int(Dividend / 2)
        fadd    st0             ; st0 = {Dividend - 1, Dividend}
        jnc     .c              ; (CF)
        fld1
        faddp                   ; st0 = Dividend [0, FFFFFFFFFFFFFFFFh]
.c:     fild    qword [edi+8]   ; Divisor is [2, 7FFFFFFFFFFFFFFFh]
        fld
        fnstcw  [edi]
        or      word [edi], 0C00h ; Truncate Towards Zero
        fldcw   [edi]
        fdivr   st2             ; st0 = st2 / st0
        fistp   qword [edi]     ; Quotient
        fild    qword [edi]
        fmulp                   ; st1 *= st0, pop st0
        fsubp                   ; st1 -= st0, pop st0
        fistp   qword [edi+8]   ; Remainder
        pop     eax edx ebx ecx edi
        ret                     ; CF=0

.TST:   test    ecx, ecx
        jnz     .a
        cmp     ebx, 1          ; Divisor is 0 or 1
        jb      .NOK
.ONE:   dec     ebx             ; Remainder ECX:EBX is 0
.NOK:   ret

.GE:    sub     eax, ebx        ; Remainder ECX:EBX is Dividend - Divisor
        sbb     edx, ecx
        mov     ebx, eax
        mov     ecx, edx
        mov     eax, 1          ; Quotient EDX:EAX is 1
        cdq
        ret                     ; CF=0

.LT:    mov     ebx, eax        ; Remainder ECX:EBX is Dividend
        mov     ecx, edx
        xor     eax, eax        ; Quotient EDX:EAX is 0
        cdq
        ret                     ; CF=0
; ------------------------------
Sep Roland
  • 33,889
  • 7
  • 43
  • 76
  • 1
    We can put less branching along the fast path (for common cases), like `cmp ebx, 1` / `jbe sort_out_0_or_1_divisor` where more branching happens off the fast path for that case? Similarly `cmp edx, ecx` / `jbe divisor_maybe_smaller` and jump back if necessary after further branching. Anyway, very dense branches are bad for branch prediction on *some* CPUs, and its more uops for the common case before we get to the common cases, like both being small numbers -> div or not -> fdiv. You'd like one of those to be reached without any *taken* branches, or as few as possible. – Peter Cordes Jun 11 '23 at 15:39
  • Also, fun trick with having `.GE:` do `call .LT`, but isn't call/inc just replacing `xor edx, edx` / `mov eax,1`? Oh, and copying 2 registers since you want the subtraction result in ecx,ebx. Still, `mov eax,1` / `cdq` is the same machine code size but faster than `call/inc`, so you're just saving 4 bytes of machine code. It might not even be that uncommon a path of execution; lots of inputs can lead to a quotient of 1. – Peter Cordes Jun 11 '23 at 15:44
  • 1
    @PeterCordes I successfully changed the code path for detecting divisors of 0 and 1. It had a small positive effect on the execution times for the other special cases and the ALU division, but not so much on the FPU division (that is 4x slower anyway). My efforts to change the code path for 'dividend LE divisor' so far didn't pay. – Sep Roland Jun 11 '23 at 19:35
  • 1
    I had a better idea than mov/cdq for returning 1. `xor edx,edx` / `lea eax, [edx+1]` is 5 bytes total, and one of them is xor-zeroing which is as cheap as a NOP on Intel sandybridge-family. But mov/cdq is more obvious to humans. For other changes like sorting out special cases inside `.LT` and `.GE`, it might be a tradeoff of making those rare cases slower vs. the common case a bit faster. But if the common case is the one involving fdiv that's super slow already, maybe there's some value to making those special cases fast, depending on what inputs are common. – Peter Cordes Jun 11 '23 at 20:35
  • Is it worth looking for the case where the divisor's low half is zero? Like `a / (b.hi * 2^32)` can reduce to `(a/2^32) / b.hi` = `a.hi / b.hi` for the quotient. The remainder is maybe just `(a.hi%b.hi)<<32 + a.lo`, so it's one `div` of the high parts and some register copying. If inputs are uniformly distributed, this will happen negligibly often so not worth branching for it, but some use-cases might have rounder divisors more often. – Peter Cordes Jun 11 '23 at 20:44
  • Again a matter of optimizing for the expected use-cases, and perhaps not over-doing it with code-size if making a generic function. libgcc's `__udivdi3` or `__udivmoddi4` certainly look big, but there's some repetition of `#if !UDIV_NEEDS_NORMALIZATION` or not. I don't see a check for `d0 == 0` in the `d1 != 0` case (https://github.com/lattera/glibc/blob/895ef79e04a953cac1493863bcae29ad85657ee1/sysdeps/wordsize-32/divdi3.c#L57) – Peter Cordes Jun 11 '23 at 20:54