Problem in multiplying two 64-bit numbers

Question

I want to multiply two 64-bit unsigned numbers together and store the result in a 128-bit variable. I did it in the way given in the link https://www.plantation-productions.com/Webster/www.artofasm.com/Windows/HTML/AdvancedArithmetica2.html#1007619 And Michael Burr has mentioned it, in response to Kawarazu's question (How can I multiply two 64-bit numbers using x86 assembly language?). This is my code:

program exercise;
static  

      qw1:qword:=  956_3547_6850_2300_5452;//$84B8_8BF3_1E4E_3B0C 
      qw2:qword:=  956_3547_6850_2300_5452;//$84B8_8BF3_1E4E_3B0C
  //qw1:qword:=  18_446_744_073_709_551_615;//$FFFF_FFFF_FFFF_FFFF
  //qw2:qword:=  18_446_744_073_709_551_615;//$FFFF_FFFF_FFFF_FFFF

    prd:lword :=0;
  
#include("Stdlib.hhf");

begin exercise;
mov((type dword qw1[0]),eax);
mul((type dword qw2[0]),eax);
mov(eax,(type dword prd[0]));
mov(edx,ecx);

mov((type dword qw1[4]),eax);
mul((type dword qw2[0]),eax);
add(ecx,eax);
mov(eax,ebx);
adc(0,edx);
mov(edx,ecx);

mov((type dword qw1[0]),eax);
mul((type dword qw2[4]),eax);
add(ebx,eax);
mov(eax,(type dword prd[4]));
adc(ecx,edx);
mov(edx,ecx);

mov((type dword qw1[4]),eax);
mul((type dword qw2[4]),eax);
add(ecx,eax);
mov(eax,(type dword prd[8]));
adc(0,edx);
mov(edx,(type dword prd[12]));
stdout.put(prd, "     ",(type uns128 prd));

end exercise;

After running the program,This code answers accurately For the product of two smaller upper numbers
(in this case, the exact answer is : 44CED55C313E2724C1798E86D8EE8890) But for the lower two larger numbers, the answer is far from the exact value
(in this case, the exact answer is: FFFFFFFFFFFFFFFE0000000000000001 but But the program gets the value FFFFFFFEFFFFFFFE0000000000000001 that there is a difference in the 13th byte).
But when I change the last adc(0,edx); to adc(1,edx);, the problem is solved.

My question is, where does this number 1 come from?
What is my mistake?
How can I modify this program?

This issue has created a serious challenge for me and I am grateful to my friends who will guide me in this matter.

greybeard · Answer 1 · 2023-08-02T13:22:14.453

You skipped commenting your source code.

The addition of the second "middle sub-product" generates a carry you don't account for.
Generally, either addition may generate a carry: just process them.

The addition of the second "middle sub-product" doesn't generate a carry for two larger numbers (Hashem alizadeh )

Well, it should:
FFFF FFFF×FFFF FFFF = FFFF FFFE 0000 0001

       hi×hi               lo×lo  
FFFF FFFE 0000 0001 FFFF FFFE 0000 0001   lower half never needs to be touched
no carry→ FFFF FFFE 0000 0001             1st add of middle sub-product, say, hi×lo
_____________________________
FFFF FFFE FFFF FFFF FFFF FFFF
   carry→ FFFF FFFE 0000 0001             2nd add of middle sub-product, say, lo×hi
_____________________________
FFFF FFFF FFFF FFFE 0000 0000

Thank you very much for your beautiful and intuitive guidance. The issue is resolved for me. [https://stackoverflow.com/users/3789665/greybeard] — Hashem alizadeh, Aug 01 '23 at 21:47

Sep Roland · Answer 2 · 2023-08-01T22:16:13.297

3

@greybeard is right in that the addition of the second cross product produces a further carry. Study the similar (decimal) case of multiplying 99 * 99 producing 9801.

Multiplying FFFF FFFF by itself gives FFFF FFFE 0000 0001 (Similar to 9 * 9 = 81).
The necessary additions then produce:

                    FFFF FFFE 0000 0001               8 1
          FFFF FFFE 0000 0001                       8 1
-----------------------------------              --------
          FFFF FFFE FFFF FFFF 0000 0001             8 9 1
          FFFF FFFE 0000 0001                       8 1
-----------------------------------              --------
        1 FFFF FFFD 0000 0000 0000 0001  <carry>  1 7 0 1
FFFF FFFE 0000 0001                               8 1
-----------------------------------              --------
FFFF FFFF FFFF FFFE 0000 0000 0000 0001           9 8 0 1

You can try next solution:

mov((type dword qw1[0]),eax);
mul((type dword qw2[0]));
mov(eax,(type dword prd[0]));
mov(edx,EBX);
xor(ecx,ecx);
xor(esi,esi);

mov((type dword qw1[4]),eax);
mul((type dword qw2[0]));
add(eax,ebx);
adc(edx,ecx);

mov((type dword qw1[0]),eax);
mul((type dword qw2[4]));
add(ebx,eax);
mov(eax,(type dword prd[4]));
adc(edx,ecx);
adc(esi,esi);      <<< If further carry exists Then make ESI = 1

mov((type dword qw1[4]),eax);
mul((type dword qw2[4]));
add(ecx,eax);
adc(esi,edx);

mov(eax,(type dword prd[8]));
mov(edx,(type dword prd[12]));

edited Aug 01 '23 at 22:16

answered Aug 01 '23 at 17:45

Sep Roland

33,889
7
43
76

1

@Hashemalizadeh Could it be the error is rather in how the result is shown by `stdout.put(prd, " ",(type uns128 prd));` ? Also, I don't use HLA, but are you sure `adc(0,edx)` uses an actual zero 'immediate' number? You could try: `xor(esi,esi)` `adc(esi,edx)`. – Sep Roland Aug 01 '23 at 18:32
1

@Hashemalizadeh I have edited my answer with a correction that deals with the further carry. – Sep Roland Aug 01 '23 at 21:52
1

I'd suggest using another register, like `setc %dl` / `movzbl %dl, %ebx`. On many CPUs that's cheaper than `sbb`/`neg`, in latency and/or in back-end execution ports (mov-elim between different regs on Ivybridge and later, and Intel before Broadwell have 2-uop `sbb`). (The false dependency on the old value of EDX isn't a problem since EDX was already an input to the `adc` that generated this CF.) Even better to reduce latency on more CPUs would be to zero another reg ahead of this which you could `setc` into the bottom of. Or `mov $0, %ebx` / `setc %bl`. – Peter Cordes Aug 01 '23 at 22:00
`setc %sil` (1 uop) is cheaper than `adc %esi, %esi` on Intel from P6 to Haswell (2 uops), or `adc $0, %esi` (2 uops on P6-family, 1 uop on Sandybridge-family). If you're going to zero another register ahead of time, prefer `setc`. (Since you're in 32-bit mode, use ESI where you were previously using EBX or ECX, so you can `setc %bl` or `%cl`.) Another option entirely would be to `mov` the `mul` results so you can do the adding after the last two `mul`s (and avoid materializing a 0/1 integer, instead maybe an `adc $0, %reg` into the high), but that might cost more than 2 `mov`s and more regs – Peter Cordes Aug 01 '23 at 22:34
I had a look at copying the second cross-product results to EDI and ESI to defer adding until after the high x high `mul`. https://pastebin.com/z0LjefwA . 1 more total instruction, but I was able to replace some `adc reg,reg` with `adc $0, %reg`, and mov-elimination saves back-end uops in case that's ever the bottleneck. But worse for the front-end on Broadwell and later. And maybe worse on Pentium M and other P6 CPUs, but some tweaking to favour `setc` over `adc` could help with that on both CPUs. And this version costs an extra register which might lead to more loads/stores. – Peter Cordes Aug 01 '23 at 23:14
1

This makes it pretty clear how nice it is to have [BMI2 `mulx`](https://www.felixcloutier.com/x86/mulx) to multiply without touching FLAGS, and into your choice of destinations. Also how nice APX `setc r32` will be in 64-bit code (e.g. for 128 x 128 => 256-bit) for efficiently materializing a full-register 0 or 1. AMD64 could have changed that. Anyway, I think your current answer is good for teaching correctness; more complicated wouldn't be better for that. – Peter Cordes Aug 01 '23 at 23:18

Problem in multiplying two 64-bit numbers

2 Answers2