How does this 128 bit integer multiplication work in assembly (x86-64)?

Question

I'm reading Computer Systems: A Programmer's Perspective and the homework was to describe how this algorithm works.

C function:

void store_prod(__int128 *dest, int64_t x, int64_t y) {
    *dest = x * (__int128)y;
}

Assembly:

movq %rdx, %rax
cqto
movq  %rsi, %rcx
sarq  $63,  %rcx
imulq %rax, %rcx
imulq %rsi, %rdx
addq  %rdx, %rcx
mulq  %rsi
addq  %rcx, %rdx
movq  %rax, (%rdi)
movq  %rdx, 8(%rdi)
ret

I don't know why it performs: xh * yl + yh * xl = value which we add after unsigned multiplication

just a guess: shifting makes it 128 bits, since you get 64 bits at the beginning. 1 and -1 are im guessing the pos/neg of the number — Andriy Lysak, Nov 18 '15 at 20:02
Both of the operands to the multiplication must be of the same type. To that end, `x` is promoted to type `__int128`, because `y` is of this type after the cast, and the integer promotion rank of `__int128` is higher than that of `int64_t`. One of the conversions is done by `cqto`, but that only works on `rax`, so the other is converted by `sarq`. — EOF, Nov 18 '15 at 20:02
@EOF but why do we multiply the low order bits of y with 1 or -1 ? imulq %rax, %rcx - this instruction, after the right shift, does exactly that. Since the low order bits, don't contain any sign information, why do we do that ? — denis631, Nov 18 '15 at 20:05
You don't multiply with `1` or `-1`, you multiply with `0` or `-1`. The arithmetic right shift does exactly what the `cqto` does: sign-extend to a whole register (`%rcx` for the `sarq`, `%rdx` for `cqto`). — EOF, Nov 18 '15 at 20:09
Since `imul` already provides a 64x64->128 bit multiply, I don't see the point of this. You can still explain how it works, of course :) Probably the usual case of disabled optimization, otherwise the compiler is clever enough to use a single `imul`. — Jester, Nov 18 '15 at 20:21
@Jester: But you can't write `imul` in C. If you multiply two 64-bit integers, you just get a 64-bit result. You *could* drop down to inline-assembly though, and do the multiplication in a single instruction. — EOF, Nov 18 '15 at 20:25
@EOF no need, as I said, at least [gcc and clang is clever enough](http://goo.gl/fmFtgG) to turn that exact C code into a single `imul`. icc, for some reason, isn't. — Jester, Nov 18 '15 at 20:33
@EOF please edit my answer, so I'll know if I am right or wrong. Thanks in advance — denis631, Nov 21 '15 at 10:17
It sounds to me like you want to ask a more general question. Such as if your processor only can do 32*32 to 64 or 64*64 to 64 how to do 128-bit multiplication? That's a much more interesting question. You don't want to do `xh * yl + yh * xl`. You can do that but then you have to figure out overflow. There is a way to do the multiplication without worrying about overflow. — Z boson, Nov 24 '15 at 13:29
@Zboson yeah. I want a deeper understanding of what is happening under the hood. I know that this formula works, but why ? — denis631, Nov 24 '15 at 13:34
@denis631, well then you have to ask a new question. Your current question is typed to x86-64 which already has an instruction to do 64*64 to 128. — Z boson, Nov 24 '15 at 13:35
@denis631, you were right not to make another question. I did not understand your question. You could have written it better. It only made sense after reading your answer. I can answer your question now. Give me a sec. — Z boson, Nov 25 '15 at 08:26

Peter Cordes · Answer 1 · 2021-07-27T10:12:27.267

As always, compiler options matter. That source code with gcc -Og (optimize for debugging) produces very similar asm to your listing (the cast sign-extends both operands to 128-bit before doing a full 128x128 => 128-bit multiply). This is a naive implementation of exactly what the C standard says should happen (integer precedence rules for converting both operands to the same type).

If you're going to talk about compiler output, you should always say which version of which compiler with what options. Or just post a link to it on godbolt, like the one above. (Edit: oops, source and asm were from a book that didn't give that info. And if that's the global edition of CS:APP 3e, beware that the practice problems are filled with errors in the global edition.)

With gcc -O3 or -O2, GCC takes advantage of the fact that both operands are still really only 64bit, so a single imul is enough. (This still produces the same result for every input, and thus still implements the C logic, per the as-if rule. C doesn't have widening operations so you're forced to write the source in an "inefficient" way that depends on the compiler to transform it into efficient asm.)

The sar $63, %rcx is part of sign-extending rsi into rcx:rsi, just like cqto sign-extends rax into rdx:rax. It replaces every bit of RCX with a copy of the original sign bit.

Most of this answer was already given by other people in comments, but I don't think anyone else noticed that gcc -Og / -O1 gives almost exactly that asm output.

thanks for the answer. As I said, it's the homework written in the book, so I don't know which compiler was used and with which optimization level flags. — denis631, Nov 19 '15 at 08:26
@TomZych: thanks for the tidy up. Minor improvement, but definitely an improvement. :) — Peter Cordes, Nov 19 '15 at 09:20
Thanks for the answer. I upvoted. A minor, language-lawyer remark here: the upward conversion to 128 bits should **not** be called an *integer promotion*, to be precise. An integer promotion specifically refers to the conversion up to `int` or `unsigned int`. For reference, see the *Integer promotions* section in https://en.cppreference.com/w/c/language/conversion — aafulei, Jul 27 '21 at 09:44
@aafulei: thanks for catching that terminology mistake from this old answer before I knew better :) — Peter Cordes, Jul 27 '21 at 10:12

denis631 · Answer 2 · 2015-11-29T05:29:06.433

In order to understand why do we do this operations, try to interpret int128_t as: 2^64 * xh + xl

so if we want to multiply two int128_t integers, we will do the following:

x = 2^64 * xh + xl

y = 2^64 * yh + yl

so x * y = (2^128 * xh * yh) + (2^64 * xh * yl) + (2^64 * yh * xl) + (yl * xl)

And this is exactly, what the assembly code does:

yh = %rdx yl = %rax

xh = %rcx xl = %rsi

2^64 * xh * yl: is imulq %rax, %rcx 2^64 indicates, that we need to add this to the high order bits

2^64 * yh * xl: is imulq %rsi, %rdx 2^64 indicates, that we need to add this to the high order bits

2^128 * xh * yh: This operation is not needed, since 2^128 * xh * yh won't fit in 128 bit integer. It represents only sign bit information and may be ignored.

xl * yl: is mulq %rsi

I hope this clears things up!

score 2 · Accepted Answer · edited May 23 '17 at 11:50

What GCC is doing is using the property that signed multiplication can be done using the following formula.

(hi,lo) = unsigned(x*y)
hi -= ((x<0) ? y : 0)  + ((y<0) ? x : 0)

Despite the fact that there is no need to do this since in this case the x86-64 instruction set has a signed 64-bit*64-bit to 128-bit instruction (imul with one operand) this formula is useful in other cases. For example for implementing signed 128-bit multiplication with SSE2/AVX2/AVX512 or for implementing 256-bit multiplication when the instruction set only does 128-bit multiplication (such as with x86-64).

GCC implemented this formula a bit differently though. If we take the sign bit and extend it to the whole word, call this function sign_ext, then the function returns -1 or 0. Then what GCC did is:

hi += sign_ext(x)*y + sign_ext(y)*x

for example sign_ext(x)*y in pseudo-instructions for 64-bit words is

sarq  $63, x    ; sign_ext(x)
imulq   y, x    ; sign_ext(x)*y

So now you ask (or meant to ask):

Why is this formula true?

That's a good qeustion. I asked this same question as well and njuffa wrote

@Zboson: It follows directly from two's complement complement representation. E.g. 32-bit integers -n and -m are represented as unsigned numbers x=2**32-n, y=2**32-m. If you multiply those you have x*y = 2**64 - 2**32*n - 2**32*m + n*m. The middle terms indicate the necessary corrections to the upper half of the product. Working through a simple example using -1*-1 should prove very instructive.

How does this 128 bit integer multiplication work in assembly (x86-64)?

3 Answers3

Linked

Related