10

How would I go about...

  • multiplying two 64-bit numbers

  • multiplying two 16-digit hexadecimal numbers

...using Assembly Language.

I'm only allowed to use registers %eax, %ebx, %ecx, %edx, and the stack.

EDIT: Oh, I'm using ATT Syntax on the x86
EDIT2: Not allowed to decompile into assembly...

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kawarazu
  • 121
  • 1
  • 1
  • 8
  • You may want to specify *what* assembly you're using. General techniques are cross-applicable (usually), but the mnemonics are almost always different between platforms. :-) – Daniel Spiewak Sep 17 '08 at 21:20
  • Oh, ATT Syntax for the x86? I'm sorry for not adding that info... *Looks for edit button on title* – Kawarazu Sep 17 '08 at 21:35
  • 1
    Related: [multiply two 32-bit numbers to get a 64-bit number, on a 8086 (32x32 => 64-bit with 16-bit multiplies)](https://stackoverflow.com/q/13451628) shows the algorithm for a widening multiply. Same deal for 64x64 => 128-bit on a 32-bit machine. – Peter Cordes Jun 08 '21 at 08:13
  • And just for fun, [32-bit extended multiplication via stack](https://stackoverflow.com/a/67922154) includes a 64x64 => 128-bit multiply in 32-bit mode, using SSE2 `pmuludq` for 2 of the 4 partial products, and scalar `mul` for the other two. – Peter Cordes Jun 25 '21 at 04:42

11 Answers11

15

Use what should probably be your course textbook, Randall Hyde's "The Art of Assembly Language".

See 4.2.4 - Extended Precision Multiplication

Although an 8x8, 16x16, or 32x32 multiply is usually sufficient, there are times when you may want to multiply larger values together. You will use the x86 single operand MUL and IMUL instructions for extended precision multiplication ..

Probably the most important thing to remember when performing an extended precision multiplication is that you must also perform a multiple precision addition at the same time. Adding up all the partial products requires several additions that will produce the result. The following listing demonstrates the proper way to multiply two 64 bit values on a 32 bit processor ..

(See the link for full assembly listing and illustrations.)

Community
  • 1
  • 1
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
4

If this was 64x86,

function(x, y, *lower, *higher)
movq %rx,%rax     #Store x into %rax
mulq %y           #multiplies %y to %rax
#mulq stores high and low values into rax and rdx.
movq %rax,(%r8)   #Move low into &lower
movq %rdx,(%r9)   #Move high answer into &higher
DopeDo
  • 49
  • 2
  • the question is about x86, not x86-64 which is obviously too easy since nothing special needs to be done – phuclv Mar 23 '19 at 17:11
2

This code assumes you want x86 (not x64 code), that you probably only want a 64 bit product, and that you don't care about overflow or signed numbers. (A signed version is similar).

MUL64_MEMORY:
     mov edi, val1high
     mov esi, val1low
     mov ecx, val2high
     mov ebx, val2low
MUL64_EDIESI_ECXEBX:
     mov eax, edi
     mul ebx
     xch eax, ebx  ; partial product top 32 bits
     mul esi
     xch esi, eax ; partial product lower 32 bits
     add ebx, edx
     mul ecx
     add ebx, eax  ; final upper 32 bits
; answer here in EBX:ESI

This doesn't honor the exact register constraints of OP, but the result fits entirely in the registers offered by the x86. (This code is untested, but I think it's right).

[Note: I transferred (my) this answer from another question that got closed, because NONE of the other "answers" here directly answered the question].

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • @PeterCordes: Are you sure? The "carry" from the lower 32 bits is the 32 bits in edx at the "add ebx edx" instruction. – Ira Baxter Mar 09 '18 at 00:12
  • Oh right, we're only doing 64x64 => 64 bits, and the low 32 bits of the result is fully determined by the low half of `lo1 * lo2` => 64 bits, with no addition that can carry into the high half result of `lo1 * hi2` => 32 bits and `lo2 * hi1` => 32 bits. So those last 2 could be done with 2-operand `imul ecx, esi` / `imul edi, ebx` instead of `xchg` and `mul`, because you only want a 64-bit result, not 96 or 128. – Peter Cordes Mar 09 '18 at 03:39
2

Since you're on x86 you need 4 mull instructions. Split the 64bit quantities into two 32bit words and multiply the low words to the lowest and 2nd lowest word of the result, then both pairs of low and high word from different numbers (they go to the 2nd and 3rd lowest word of the result) and finally both high words into the 2 highest words of the result. Add them all together not forgetting to deal with carry. You didn't specify the memory layout of the inputs and outputs so it's impossible to write sample code.

jjrv
  • 4,211
  • 2
  • 40
  • 54
  • 2
    I think the high*high part is unneeded as it overflows no matter what – BCS Sep 17 '08 at 21:33
  • @BCS: Correct, the high x high part is only needed if you want a 64x64 => 128-bit full multiply ([example with SSE2](https://stackoverflow.com/a/67922154)). The two halves of h x h line up with the top two chunks of the 128-bit full product, fully outside the input-width result. https://godbolt.org/z/GWT86f78c shows GCC `-O3 -m32` output for a non-widening `uint64_t` product on 32-bit x86, using two non-widening 32-bit imul for the cross products, and one widening `mul` for the low x low, with no adc needed. – Peter Cordes Jun 25 '21 at 04:47
0

It depends what language you are using. From what I remember from learning MIPS assembly, there is a Move From High command and a Move From Lo command, or mflo and mfhi. mfhi stores the top 64bits while mflo stores the lower 64bits of the total number.

Dave
  • 89
  • 1
  • 2
  • 11
0

ah assembly, been awhile since i've used it. so i'm assuming the real problem here is that the microcontroller (what i used to write code for in assembly anyways) you're working on doesn't have 64 bit registers? if that's the case, you're going to have the break the numbers you're working with apart and perform multiple multiplications with the pieces.

this sounds like it's a homework assignment from the way you've worded it, so i'm not gonna spell it out much further :P

0

Just do normal long multiplication, as if you were multiplying a pair of 2-digit numbers, except each "digit" is really a 32-bit integer. If you're multiplying two numbers at addresses X and Y and storing the result in Z, then what you want to do (in pseudocode) is:

Z[0..3] = X[0..3] * Y[0..3]
Z[4..7] = X[0..3] * Y[4..7] + X[4..7] * Y[0..3]

Note that we're discarding the upper 64 bits of the result (since a 64-bit number times a 64-bit number is a 128-bit number). Also note that this is assuming little-endian. Also, be careful about a signed versus an unsigned multiply.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • 1
    That's missing the high bits from the first part – BCS Sep 17 '08 at 21:31
  • Wait, you've confused me-- You've said to go ahead and get rid of the upper 64 bits of the result? Why would that be... well, rational...? – Kawarazu Sep 17 '08 at 21:33
0

Find a C compiler that supports 64bit (GCC does IIRC) compile a program that does just that, then get the disassembly. GCC can spit it out on it's own and you can get it out of object file with the right tools.

OTOH their is a 32bX32b = 64b op on x86

a:b * c:d = e:f
// goes to
e:f = b*d;
x:y = a*d;  e += x;
x:y = b*c;  e += x;

everything else overflows

(untested)

Edit Unsigned only

BCS
  • 75,627
  • 68
  • 187
  • 294
-1

I'm betting you're a student, so see if you can make this work: Do it word by word, and use bit shifts. Think up the most efficient solution. Beware of the sign bit.

easeout
  • 8,665
  • 5
  • 43
  • 51
  • 1
    In the end, bit shifts won't be more efficient than the multiplication instructions - at least not on the x86 and not with 32-bit values. To learn the basics of what the processor does to perform a multiplication it's a good exercise though. – Olof Forshell Feb 09 '11 at 08:06
-3

If you want 128 mode try this...

__uint128_t AES::XMULTX(__uint128_t TA,__uint128_t TB)
{
    union
    {
        __uint128_t WHOLE;
        struct
        {
            unsigned long long int LWORDS[2];
        } SPLIT;
    } KEY;
    register unsigned long long int __XRBX,__XRCX,__XRSI,__XRDI;
    __uint128_t RESULT;

    KEY.WHOLE=TA;
    __XRSI=KEY.SPLIT.LWORDS[0];
    __XRDI=KEY.SPLIT.LWORDS[1];
    KEY.WHOLE=TB;
    __XRBX=KEY.SPLIT.LWORDS[0];
    __XRCX=KEY.SPLIT.LWORDS[1];
    __asm__ __volatile__(
                 "movq          %0,             %%rsi           \n\t"       
                 "movq          %1,             %%rdi           \n\t"
                 "movq          %2,             %%rbx           \n\t"
                 "movq          %3,             %%rcx           \n\t"
                 "movq          %%rdi,          %%rax           \n\t"
                 "mulq          %%rbx                           \n\t"
                 "xchgq         %%rbx,          %%rax           \n\t"
                 "mulq          %%rsi                           \n\t"
                 "xchgq         %%rax,          %%rsi           \n\t"
                 "addq          %%rdx,          %%rbx           \n\t"
                 "mulq          %%rcx                           \n\t"
                 "addq          %%rax,          %%rbx           \n\t"
                 "movq          %%rsi,          %0              \n\t"
                 "movq          %%rbx,          %1              \n\t"
                 : "=m" (__XRSI), "=m" (__XRBX)
                 : "m" (__XRSI),  "m" (__XRDI), "m" (__XRBX), "m" (__XRCX)
                 : "rax","rbx","rcx","rdx","rsi","rdi"
                 );
    KEY.SPLIT.LWORDS[0]=__XRSI;
    KEY.SPLIT.LWORDS[1]=__XRBX;
    RESULT=KEY.WHOLE;
    return RESULT;
}
Sven Schoenung
  • 30,224
  • 8
  • 65
  • 70
user80998
  • 48
  • 2
  • 1
    Why would you require the inputs and outputs to be in memory, instead of just letting gcc load them if necessary? You'd get better asm from just using `__uint128_t` and letting gcc do it. Also, this pretty much just copies Ira Baxter's answer. If it was *good* inline asm, it might be a useful addition, but it's not. Using a lot of fixed registers, and requiring memory operands is nowhere near optimal. You should just let gcc handle everything except the 64x64 -> 128 `mul` instructions, using two separate `asm` statements with 2 inputs and 2 outputs. And this shouldn't be `volatile` – Peter Cordes Mar 05 '16 at 10:54
  • Because I am interfacing C into Assembler and I don't have the source code for the libs that GCC use and I don't want to use displacements as that would screw up the answer that I am looking for and I also wanted to speed up a piece of code that raises n^p mod n OK. – user80998 Mar 06 '16 at 21:09
  • You should be using code like `asm ("mulq %[src]" : "=a"(lo64_result), "=d"(hi64_result) : "a"(KEY.SPLIT.LWORDS[1]), [src] "rm" (KEY.SPLIT.LWORDS[0]));` And then a second similar asm statement for the second multiply. gcc will do all the `mov` instructions. You just tell it where things need to be, and where the results will appear. Source for gcc library code has nothing to do with anything, and neither do memory displacements. See the [tag:x86] tag wiki for links about how to write inline asm that isn't terrible. That compiles: http://goo.gl/izSfMi – Peter Cordes Mar 07 '16 at 03:06
  • I actually extracted this from Borland Turbo C++ so I am not copying using debug86.exe – user80998 Aug 19 '16 at 16:30
-3

If you want 128bit multiplication then this should work this is in AT&T format.

__uint128_t FASTMUL128(const __uint128_t TA,const __uint128_t TB)
{
    union
    {
        __uint128_t WHOLE;
        struct
        {
            unsigned long long int LWORDS[2];
        } SPLIT;
    } KEY;
    register unsigned long long int __RAX,__RDX,__RSI,__RDI;
    __uint128_t RESULT;

KEY.WHOLE=TA;
__RAX=KEY.SPLIT.LWORDS[0];
__RDX=KEY.SPLIT.LWORDS[1];
KEY.WHOLE=TB;
__RSI=KEY.SPLIT.LWORDS[0];
__RDI=KEY.SPLIT.LWORDS[1];
__asm__ __volatile__(
    "movq           %0,                             %%rax                   \n\t"
    "movq           %1,                             %%rdx                   \n\t"
    "movq           %2,                             %%rsi                   \n\t"
    "movq           %3,                             %%rdi                   \n\t"
    "movq           %%rsi,                          %%rbx                   \n\t"
    "movq           %%rdi,                          %%rcx                   \n\t"
    "movq           %%rax,                          %%rsi                   \n\t"
    "movq           %%rdx,                          %%rdi                   \n\t"
    "xorq           %%rax,                          %%rax                   \n\t"
    "xorq           %%rdx,                          %%rdx                   \n\t"
    "movq           %%rdi,                          %%rax                   \n\t"
    "mulq           %%rbx                                                   \n\t"
    "xchgq          %%rbx,                          %%rax                   \n\t"
    "mulq           %%rsi                                                   \n\t"
    "xchgq          %%rax,                          %%rsi                   \n\t"
    "addq           %%rdx,                          %%rbx                   \n\t"
    "mulq           %%rcx                                                   \n\t"
    "addq           %%rax,                          %%rbx                   \n\t"
    "movq           %%rsi,                          %%rax                   \n\t"
    "movq           %%rbx,                          %%rdx                   \n\t"
    "movq           %%rax,                          %0                      \n\t"
    "movq           %%rdx,                          %1                      \n\t"
    "movq           %%rsi,                          %2                      \n\t"
    "movq           %%rdi,                          %3                      \n\t"
    : "=m"(__RAX),"=m"(__RDX),"=m"(__RSI),"=m"(__RDI)
    :  "m"(__RAX), "m"(__RDX), "m"(__RSI), "m"(__RDI)
    : "rax","rbx","ecx","rdx","rsi","rdi"
);
KEY.SPLIT.LWORDS[0]=__RAX;
KEY.SPLIT.LWORDS[1]=__RDX;
RESULT=KEY.WHOLE;
return RESULT;
}
SAm
  • 2,154
  • 28
  • 28
user80998
  • 48
  • 2
  • This is a repost of the same user's [inefficient and clunky answer on this question](https://stackoverflow.com/a/35813111/224132). See my comments there. Don't do that. Edit your first one if you have changes. – Peter Cordes Mar 08 '18 at 22:56