Signed 64-Bit multiply and 128-Bit Divide on x86 in assembly

Question

I have 2 functions written in assembly (masm) in visual studio that i use in my C++ project. They are an Unsigned 64-Bit multiply function that produces a 128-Bit result, and a Unsigned 128-Bit divide function that produces a 128-Bit Quotient and returns a 32-Bit Remainder.

What i need is a signed version of the functions but I'm not sure how to do it.

Below is the code of the .asm file with the Unsigned functions:

.MODEL flat, stdcall
.CODE

MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD
push EAX
push EDX
push EBX
push ECX
push EDI
mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
MUL EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
MUL EDX
ADD EAX,ECX
ADC EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
MUL EDX
ADD EAX,EBX
ADC ECX,EDX
PUSHFD ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
MUL EDX
POPFD ; Retrieve carry from above.
ADC EAX,ECX
ADC EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop ECX
pop EBX
pop EDX
pop EAX
ret 20
MUL64 ENDP

IMUL64 PROC, A:SQWORD, B:SQWORD, pi128:DWORD
; How to make this work?
ret 20
IMUL64 ENDP

DIV128 PROC, pDividend128:DWORD, Divisor:DWORD, pQuotient128:DWORD
push EDX
push EBX
push ESI
push EDI
MOV ESI,pDividend128
MOV EDI,pQuotient128
MOV EBX,Divisor
XOR EDX,EDX
MOV EAX,[ESI+12]
DIV EBX
MOV [EDI+12],EAX
MOV EAX,[ESI+8]
DIV EBX
MOV [EDI+8],EAX
MOV EAX,[ESI+4]
DIV EBX
MOV [EDI+4],EAX
MOV EAX,[ESI]
DIV EBX
MOV [EDI],EAX
MOV EAX,EDX
pop EDI
pop ESI
pop EBX
pop EDX
ret 12
DIV128 ENDP

IDIV128 PROC, pDividend128:DWORD, Divisor:DWORD, pQuotient128:DWORD
; How to make this work?
ret 12
IDIV128 ENDP

END

If you found this helpful in anyway please help the project by helping code the Signed version of the functions.

it's easy to [get the high part of the unsigned product from a signed multiplication and vice versa](https://stackoverflow.com/a/28827013/995714). But it'll be easier and better to use [`__int128` in other compilers](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) — phuclv, Feb 07 '18 at 17:18
`pushfd` / `popfd` is *slow*. And you don't need to save/restore EAX/ECX/EDX. MSVC has intrinsics for 32x32 => 64-bit multiply, so you could write this in C++, although this version https://stackoverflow.com/questions/46870373/umul128-on-windows-32-bits which is as efficient as I could make it still doesn't compile to good asm. Fixing this to save carry with `setc bl` into a register might work better. — Peter Cordes, Feb 07 '18 at 18:24
This is really an application of math. Remember in school how you were taught to divide signed number? Do that here. — Raymond Chen, Feb 07 '18 at 18:54
@fuz: EAX is needed as an implicit operand for one-operand `mul`, so that would be less efficient in this case, I think. — Peter Cordes, Feb 07 '18 at 21:58
`my128.Hi -= (((A < 0) ? B : 0) + ((B < 0) ? A : 0));` Formula works. Any ideas how to make the 128-Bit divide signed? — b.sullender, Feb 08 '18 at 02:40
@PeterCordes `setc bl` may be faster than `pushfd` but what instruction would you use to restore the carry flag or add it to the register. If i just add it to the register then i will first need to zero extend `bl` with `movzx` or i could set `ebx` to zero before saving using `xor`. Which do you think is faster than `popfd`? Are you saying that an stdcall always expects EAX/ECX/EDX to be destroyed regardless if it returns a value? — b.sullender, Feb 11 '18 at 21:10
@fuz thats not gonna work. If you set any/or all of the div to idiv then an overflow error happens. Raymond Chen was correct, the solution is actually very simple. I will post my formula and code when I'm done writing the code. — b.sullender, Feb 11 '18 at 21:12
@AlwaysNub: `popfd` is 9 uops, with one per 20c throughput on Skylake! (http://agner.org/optimize/). You can turn `bl` back into CF with `add bl, 0xFF` (1 uop, 1c latency) to set up for `adc`. Or if you don't need need the carry-out result from a single `adc`, you can do it in 2 steps like you say by zero-extending the saved CF. xor-zero `ebx` before the `add` that sets CF is usually base: only the `add` is on the critical path. Of if you can't spare a register, then `setc bl` / `movzx ebx, bl` is maybe still better than 2-uop `adc` on pre-Skylake. https://stackoverflow.com/a/33668295/ — Peter Cordes, Feb 11 '18 at 23:10
And yes, those registers are call-clobbered in stdcall and other calling conventions, regardless of the return type. You can look at compiler-generated code for simple functions on http://godbolt.org/ to see that the compiler in practice does clobber those registers. — Peter Cordes, Feb 11 '18 at 23:12

b.sullender · Accepted Answer · 2018-02-19T06:14:42.800

First, the MUL64 function does not work 100%

If you try to do 0xFFFFFFFFFFFFFFFF x 0xFFFFFFFFFFFFFFFF, the Hi 64-bit result is 0xFFFFFFFeFFFFFFFF, it should be 0xFFFFFFFFFFFFFFFe

To fix this, the carry flag after the POPFD instruction should be added to EDX, the highest 32-bit part of the result. Now following Peter Cordes advice, remove the push and pops of EAX/ECX/EDX. Finally use setc BL and movzx EBX,BL to save the flag. Note: you cannot easily use xor EBX,EBX to zero it because xor effects the flags. We use movzx because its faster than add BL,0xFF and add is faster than adc based on Skylake specs.

The Result:

MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD
push EBX
push EDI
mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
mul EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
mul EDX
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
mul EDX
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
movzx EBX,BL ; Zero-Extend carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
mul EDX
add EDX,EBX ; Add carry from above.
add EAX,ECX
adc EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop EBX
ret 20
MUL64 ENDP

Now, to make a signed version of the function use this formula:

my128.Hi -= (((A < 0) ? B : 0) + ((B < 0) ? A : 0));

The Result:

IMUL64 PROC, A:SQWORD, B:SQWORD, pi128:DWORD
push EBX
push EDI
mov EDI,pi128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
mul EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
mul EDX
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
mul EDX
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
movzx EBX,BL ; Zero-Extend carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
mul EDX
add EDX,EBX ; Add carry from above.
add EAX,ECX
adc EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
; Signed version only:
cmp DWORD PTR A+4,0
jg zero_b
jl use_b
cmp DWORD PTR A,0
jae zero_b
use_b:
mov ECX,DWORD PTR B
mov EBX,DWORD PTR B+4
jmp test_b
zero_b:
xor ECX,ECX
mov EBX,ECX
test_b:
cmp DWORD PTR B+4,0
jg zero_a
jl use_a
cmp DWORD PTR B,0
jae zero_a
use_a:
mov EAX,DWORD PTR A
mov EDX,DWORD PTR A+4
jmp do_last_op
zero_a:
xor EAX,EAX
mov EDX,EAX
do_last_op:
add EAX,ECX
adc EDX,EBX
sub [EDI+8],EAX
sbb [EDI+12],EDX
; End of signed version!
pop EDI
pop EBX
ret 20
IMUL64 ENDP

The DIV128 function should be fine (also probably the fastest) for getting a 128-bit quotient from a 32-bit divisor, but if you need to use a 128-bit divisor then look at this code https://www.codeproject.com/Tips/785014/UInt-Division-Modulus that has an example of using the Binary Shift Algorithm for 128-bit division. It could probably be 3x faster if written in assembly.

To make a signed version of DIV128, first determine if the sign of the divisor and dividend are the same or different. If they are the same, then the result should be positive. If they are different, then the result should be negative. So... Make the dividend and divisor positive if they are negative and call DIV128, after that, negate the results if the signs were different.

Here is some example code written in C++

VOID IDIV128(PSDQWORD Dividend, PSDQWORD Divisor, PSDQWORD Quotient, PSDQWORD Remainder)
{
    BOOL Negate;
    DQWORD DD, DV;

    Negate = TRUE;

    // Use local DD and DV so Dividend and Divisor dont get currupted.
    DD.Lo = Dividend->Lo;
    DD.Hi = Dividend->Hi;
    DV.Lo = Divisor->Lo;
    DV.Hi = Divisor->Hi;

    // if the signs are the same then: Negate = FALSE;
    if ((DD.Hi & 0x8000000000000000) == (DV.Hi & 0x8000000000000000)) Negate = FALSE;

    // Covert Dividend and Divisor to possitive if negative: (negate)
    if (DD.Hi & 0x8000000000000000) NEG128((PSDQWORD)&DD);
    if (DV.Hi & 0x8000000000000000) NEG128((PSDQWORD)&DV);

    DIV128(&DD, &DV, (PDQWORD)Quotient, (PDQWORD)Remainder);

    if (Negate == TRUE)
    {
        NEG128(Quotient);
        NEG128(Remainder);
    }
}

EDIT:

Following Peter Cordes advice, we can optimize MUL64/IMUL64 even more. Look at the comments for specific changes being made. I have also replaced MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD with MUL64@20: and IMUL64@20: to eliminate unnecessary use of EBP that masm adds. I also optimized the sign-fixing work for IMUL64.

The current .asm file for MUL64/IMUL64

.MODEL flat, stdcall

EXTERNDEF  MUL64@20     :PROC
EXTERNDEF  IMUL64@20    :PROC

.CODE

MUL64@20:
push EBX
push EDI

;            -----------------
;            |     pu128     |
;            |---------------|
;            |       B       |
;            |---------------|
;            |       A       |
;            |---------------|
;            |  ret address  |
;            |---------------|
;            |      EBX      |
;            |---------------|
;    ESP---->|      EDI      |
;            -----------------

A       TEXTEQU   <[ESP+12]>
B       TEXTEQU   <[ESP+20]>
pu128   TEXTEQU   <[ESP+28]>

mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mul DWORD PTR B
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mul DWORD PTR B+4
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B+4
add EAX,ECX
movzx ECX,BL ; Zero-Extend saved carry from above.
adc EDX,ECX
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop EBX
ret 20

IMUL64@20:
push EBX
push EDI

;            -----------------
;            |     pi128     |
;            |---------------|
;            |       B       |
;            |---------------|
;            |       A       |
;            |---------------|
;            |  ret address  |
;            |---------------|
;            |      EBX      |
;            |---------------|
;    ESP---->|      EDI      |
;            -----------------

A       TEXTEQU   <[ESP+12]>
B       TEXTEQU   <[ESP+20]>
pi128   TEXTEQU   <[ESP+28]>

mov EDI,pi128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mul DWORD PTR B
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mul DWORD PTR B+4
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B+4
add EAX,ECX
movzx ECX,BL ; Zero-Extend saved carry from above.
adc EDX,ECX
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
; Signed version only:
mov BL,BYTE PTR B+7
and BL,80H
jz zero_a
mov EAX,DWORD PTR A
mov EDX,DWORD PTR A+4
jmp test_a
zero_a:
xor EAX,EAX
mov EDX,EAX
test_a:
mov BL,BYTE PTR A+7
and BL,80H
jz do_last_op
add EAX,DWORD PTR B
adc EDX,DWORD PTR B+4
do_last_op:
sub [EDI+8],EAX
sbb [EDI+12],EDX
; End of signed version!
pop EDI
pop EBX
ret 20

END

On Skylake, it's break-even between using `add BL,0xFF` to restore CF and then use `adc` vs. what you're doing. It's only on Haswell and earlier where `adc` is more expensive. So your code is fine, but you describe the reason wrong. — Peter Cordes, Feb 17 '18 at 07:22
There's a further optimization possible: `setc al` / `movzx ebx, al` allows [mov-elimination for the movzx on IvyBridge and later](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all), because it's between different registers. And: instead of `add EDX,EBX ; carry from above.` / `add EAX,ECX` / `adc EDX,0`, use instead **`add EAX,ECX` / `adc EDX, EBX ; HI += CF + saved CF`** — Peter Cordes, Feb 17 '18 at 07:25
Also, `mov edx, B` / `mul edx` would be better as `mul dword ptr B`. If you're not going to keep the value around in a register, there's no benefit to loading it separately. Intel/AMD CPUs can keep the memory operand micro-fused so this saves a uop for the front-end. You aren't using `esi`, so you could save/restore it as well, but maybe not worth vs. just loading the same data multiple times. And probably also not worth it to enable `xor ecx,ecx` before a mul / add / adc instead of `movzx` after, because I don't think the save/restore of CF is on the critical latency path. — Peter Cordes, Feb 17 '18 at 08:01
Can you use `imul` for the `HI(A) * HI(B)` part, to save some of the sign-fixing work? Also, that part could probably benefit from keeping more input data in a register instead of using so many memory operands. So the signed version should maybe save/restore `esi` and `ebp` (if you're not already using `ebp` as a frame pointer). — Peter Cordes, Feb 17 '18 at 08:11
@PeterCordes Awsome work! Look at the edit to my answer for my changes. I did some testing and i think imul could be used to help. but things might get messy. Instead i replaced the cmp's with`mov BL,BYTE PTR B+7` / `and BL,80H` / `jz zero_a` and removed 2 unnecessary mov's. — b.sullender, Feb 19 '18 at 06:18
Yeah, `HI(A) * LO(B)` and the other cross term is signed * unsigned, and there's no instruction for that. So you can't avoid fixups altogether. I didn't really look at your fixup code, so IDK if there's more room for improvement. Depending on the use-case, branchless is worth considering, although many use-cases will branch the same way or with a simple pattern that depends on recent branches in calling code so jz will predict well. Beware of store-forwarding stalls though if you modify just a single byte in memory before the caller loads the result as dwords. — Peter Cordes, Feb 19 '18 at 06:26

Signed 64-Bit multiply and 128-Bit Divide on x86 in assembly

1 Answers1