5

I have two 128 bit numbers in memory in hexadecimal, for example (little endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

I've to perform the unsigned multiplication between these two numbers so my new number will be:

z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).

Can you give me some input about this problem?
I think the best solution is to take one byte par time.

zx485
  • 28,498
  • 28
  • 50
  • 59
Davide
  • 1,931
  • 2
  • 19
  • 39
  • x86 provides a 64 × 64 → 128 multiplication with the result in rdx:rax. You can use that operation to get the desired result. If you are interested, I can give details. Please tell me whether you want signed or unsigned multiplication. – fuz Nov 24 '16 at 22:58
  • Gladly. The multiplication has to be unsigned. But I've to do a 128 x 128 -> 128 multiplication. If it goes in overflow I've to keep only the lowest part (it's normal) and set the OF flag. Let me know – Davide Nov 24 '16 at 23:02
  • Simply: The max number is (2^128)-1 – Davide Nov 24 '16 at 23:39
  • @Davide Are you satisfied with my answer? Is there any additional information you need? – fuz Nov 24 '16 at 23:40
  • I'm trying to understand it. If I've question on the matter I will ask you again. Many thanks :-) – Davide Nov 24 '16 at 23:41
  • Does [Karatsuba algo](https://en.wikipedia.org/wiki/Karatsuba_algorithm) serves your purpose? – Adrian Colomitchi Nov 25 '16 at 00:17
  • 2
    @AdrianColomitchi the numbers here are too short for that to help, the simple way to do it is already only 3 multiplications – harold Nov 25 '16 at 00:30
  • @harold: his numbers are short enough to do a 16x16 = 32 bit multiplication, but I think that's just an example – Tommylee2k Nov 25 '16 at 08:46
  • @AdrianColomitchi Indeed, Karatsuba would be useful if OP cared about the high 128 bit of the result, too. – fuz Nov 25 '16 at 13:22

2 Answers2

9

Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:

q1μ + q2 = a2b2
r1μ + r2 = a1b2
s1μ + s2 = a2b1
t1μ + t2 = a1b1

and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):

c = t1μ3 + (t2 + s1 + r1) μ2 + (s2 + r2 + q1) μ + q2

fuz
  • 88,405
  • 25
  • 200
  • 352
  • 5
    Just an observation: might be easier for OP to relate better if you used his `x`, `y`, `z` – Michael Petch Nov 25 '16 at 01:55
  • If you only need the low 128 of the result, you don't need some of the terms, and don't need to accumulate the full 256b result in the first place. Carry propagates strictly right-to-left, so nothing that happens to bits above the 128 you want can affect the result. (I'm sure you know this, but worth pointing out since [not everyone does](http://stackoverflow.com/questions/40751662/is-a-b-255-255-the-same-as-a-b-255)). Thus, the OP won't even need to compute the a1b1 multiply of high halves. – Peter Cordes Nov 25 '16 at 04:25
8

As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.

__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

compiles to (gcc6.2 -O3 on Godbolt)

   imul    rsi, rdx        # a_hi * b_lo
   mov     rax, rdi
   imul    rcx, rdi        # b_hi * a_lo
   mul     rdx             # a_lo * b_lo  widening multiply
   add     rcx, rsi        # add the cross products ...
   add     rdx, rcx        # ... into the high 64 bits.
   ret

Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.

Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.

With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.


Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.


Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • With this example, rdx contains the half 128 bit numbers and rdi the other half? – Davide Nov 25 '16 at 23:47
  • @Davide: highlighted the return-result location for you. The 128b result has its high half in RDX, low half in RAX. Notice that the low 64 bits of the result only depends on the low 64 bits of the two inputs. (Produced with MUL). – Peter Cordes Nov 26 '16 at 04:21
  • Doesn't works for me. I want to multiply 596a*18c2. Then in rsi I move 59h and in rdi 6ah. In rcx I move 18h and in rdx 0c2h. The result is in rdx 0x4d62 and in rax 0x5054, that doesn't correspond to the real multiplication, Result has to be 0x8a5b254 – Davide Nov 28 '16 at 19:51
  • @Davide: **You don't split the inputs in half, you split them on a 64-bit boundary.** You multiplied `0x000000590000006a * 0x00000018000000c2`, and then apparently looked at the low 64 bits of the result. The result is `0x00004d6200005054`, which this code returns as `RDX = 00004d62`, and `RAX = 0x00005054`, which is what you got. – Peter Cordes Nov 29 '16 at 00:55
  • Thanks Peter, sorry, but maybe you haven't understand. My result must be 0x0000000008a5b254 with that multiplication. With RDX = 00004d62 and RAX = 0x00005054 I can't do nothing – Davide Nov 29 '16 at 01:05
  • 1
    @Davide: apparently you don't understand. You need to set `rsi=0`:`rdi=0x596a` and `rcx=0`:`rdx=0x18c2` to do pass the inputs you described, and get the result you want. Also, if you know the high halves are zero, just use a single MUL instruction to get a 128b result. – Peter Cordes Nov 29 '16 at 01:12