5

As somebody new to assembly, I use gcc for reverse engineering. But now I ran in a somehow funny problem: I try to multiply two 64bit-integer for x86-64. The C - code looks as follows:

unsigned long long 
val(unsigned long long a, unsigned long long b){
    return a*b;
}

and compiled with gcc:

val:
    movq    %rdi, %rax
    imulq   %rsi, %rax
    ret

It might be counterintuitive to use signed multiplication for unsigned integers, but it works for C.

However, I would like to check the multiplication for overflows. Now, the overflow flag is set if the result is greater than 2^63-1 (I guess because it is signed multiplication after all). But for unsigned 64bit this would be still OK as long as the result is not greater than 2^64-1.

What is the right way to do the multiplication (in assembly) in this case?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
ead
  • 32,758
  • 6
  • 90
  • 153
  • Could you, perhaps, test, *after the fact,* to see if the result is "(unsigned ...) smaller" than it used to be? Just guessin' ... – Mike Robinson Jul 07 '16 at 19:25
  • 2
    Or, you could simply use `mul`, which sets CF and OF if the result of the unsigned multiplication overflows operand-size. – EOF Jul 07 '16 at 19:27
  • @EOF why does gcc not use `mul` for unsigned multiplication? – ead Jul 07 '16 at 19:35
  • 2
    @ead: Because `imul` does the same for `n*n -> n`-bit multiplications. Also, `mul` only has a one-operand form (it uses `[r/e]ax` implicitly) and always clobbers `[r/e]ax` and `[r/e]dx`. It is simply less flexible. – EOF Jul 07 '16 at 19:38
  • @MikeRobinson I don't think this would work. As example for calculating mod 16: `6*7=42=10 mod 16` with 10 greater than 6 and 7. – ead Jul 07 '16 at 19:40
  • @MikeRobinson that won't work. For example in the unsigned 8-bit case where `3 * 129 = 387` that product is truncated to `131` which is greater than both the operands. – Weather Vane Jul 07 '16 at 19:43
  • I hereby *"as-gracefully as-possible ;-) **yield** to the now-revealed error of my ways ..."* Thank you, all. "You're quite correct. Me bad." – Mike Robinson Jul 07 '16 at 19:53
  • @MikeRobinson it was a good guess and I even tried that once. Better is a trial division from the max first, if you don't have access to processor flags. – Weather Vane Jul 07 '16 at 19:55
  • *@MikeRobinson bows gracefully again ...* – Mike Robinson Jul 07 '16 at 19:57
  • 2
    In C, unsigned multiplication does not overflow. If the result is outside the range of the unsigned type, it's adjusted by in effect discarding the high-order bits. If you want to check whether any such adjustment was necessary, then you're doing something other than C unsigned multiplication. (I absolutely do not mean to imply that there's anything wrong with that.) – Keith Thompson Jul 07 '16 at 20:15
  • 3
    @KeithThompson: "overflow" is getting used here to mean "wrap around", as in whether the discarded high half was non-zero. It's one of those words that has at least one technical meaning, but gets used for other related things that have different technical names. (e.g. carry out of unsigned addition) – Peter Cordes Jul 07 '16 at 20:21

2 Answers2

9

It looks like you can't use imul without a bunch of extra code, since CF and OF are both set the same way. As the "operation" section of the manual describes, they're set if the full 128b result doesn't match sign_extend(low_half_result). So you're right, even the multi-operand forms of imul still have some signed behaviour. It would be nice if they were like add/sub and set OF and CF independently, so you could look at CF for unsigned data or OF for signed data.

One of the best ways to find a good asm sequence for something is to ask a compiler. C doesn't have convenient integer-overflow detection, but Rust does.

I compiled this function to return the value and the unsigned-wraparound detection bool. Apparently Rust's ABI returns them passing pointer as a hidden first arg, instead of in rdx:rax like I think the C ABI would for such a small struct. :(

pub fn overflowing_mul(a: u64, b: u64) -> (u64, bool) {
  a.overflowing_mul(b)
}
    # frame-pointer boilerplate elided
    mov     rax, rsi
    mul     rdx
    mov     qword ptr [rdi], rax
    seto    byte ptr [rdi + 8]

    mov     rax, rdi                # return the pointer to the return-value
    ret

Asm output from the Godbolt compiler explorer (Rust 1.7.0). This more or less confirms that the mov instruction and the extra uop for a one-operand full multiply is more efficient than anything we could do with extra checks after a two-operand imul.

The documentation for mul says

"The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1."

So in summary, use mul and check OF or CF to see if the high half is non-zero.


mul vs. imul trivia:

Only the upper half of the full-multiply (N x N => 2N) result differs between imul and mul. I think Intel picked imul as the one that would have multiple explicit operands so that
imul r32, r32, sign-extended-imm8 would make more sense, because sign-extension is probably more useful than zero-extension.

I only just realised that the flag results from imul were signed-only, though. Interesting point.


why does gcc not use mul for unsigned multiplication?

Because one-operand mul/imul are slower (2 uops instead of 1 on Intel CPUs, according to Agner Fog's insn tables. See also the tag wiki). They also use more registers: They require one of their inputs in rax, and produce their outputs in rdx:rax, so extra mov instructions are usually required to move data in/out of those regs.

Thus, imul r64, r64 is a better choice than mul r64, if you don't care about the flag result.

On Intel CPUs imul r64,r64 is actually faster than mul r32. That's not the case on some other CPUs, including AMD Bulldozer-family where 64bit multiplies are somewhat slower. But since mul r32 puts its results into edx:eax instead of just one destination register, they're not direct replacements for each other in most cases anyway.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    I'm a little confused by your very last paragraph. How is `imul r64, 64` a better choice than `mul r32` on most CPUs, if it's slower on Atom and AMD? Am I reading that wrong? – Cody Gray - on strike Jul 07 '16 at 20:27
  • @CodyGray: `mul r32` has extra overhead to move the data into the right registers. This makes up for some of the speed difference on other CPUs. Also, `imul r64, r64` is faster on modern Intel, i.e. "most CPUs" that people are tuning for. I meant "most CPUs" as in average weighted by popularity, not "most kinds of CPUs". IIRC, clang tends to generate `imul r64,r64` even if you cast two 32bit variables to 64bit. – Peter Cordes Jul 07 '16 at 20:32
  • @CodyGray: although to be honest, compilers apparently behave that way because they're dumb, not necessarily because it's better. [Even with `-march=bdver2` (i.e. AMD piledriver), clang and gcc use `imul r64, r64`](https://godbolt.org/g/KjTyJz). Actually even with `-march=atom`, where it's completely horrible. Oh, I just realized that `mul r32` would leave the result split between two registers. Hmm, I'm not sure if there's still a useful point to be made. Maybe I should just take out that last paragraph. – Peter Cordes Jul 07 '16 at 20:37
  • I suspect compilers prefer `imul r64, r64` because it makes register allocation significantly easier. But that's also a good point about leaving the results split between two registers. There would undoubtedly be a penalty in merging those two results, and when there are 64-bit registers available, you might as well use them from the start. But that wouldn't matter in the case where you were multiplying two 32-bit values, because the C and C++ language standards permit the compiler to just throw away the upper 32-bits of the result. – Cody Gray - on strike Jul 07 '16 at 20:41
  • @CodyGray: `imul r32,r32` is always at least as fast as `mul r32` on all CPUs. This only comes up for compilers when they know the multiplicands are all zero in their upper 32 bits (e.g. after a cast). And yeah, merging takes a SHL + OR. I think I remember testing what compilers did, with some code where either was a valid option. Maybe storing the results to a struct, where it was just one 64b store vs. two 32b stores? Or with ORing the upper and lower halves just to force the compiler to combine them either way. – Peter Cordes Jul 07 '16 at 20:52
  • @PeterCordes: I'm curious why the sign-extended form is more useful than a non-extended form. Given an unsigned form, multiplying by -5 could be accomplished, regardless of register choice, by a MUL and a NEG. If one wanted to store AX*251 in BX, one could use an IMUL to compute AX*-5 and then add AL to BH, but that approach would only work with certain combinations of registers. – supercat Nov 17 '16 at 21:40
  • @supercat: If you wanted AX*251, you would just use `imul BX, AX, imm16` (one extra byte) instead of another whole instruction. Integers from -128..0 were probably more common than integers from 128..255. Structs as large as 128B were probably rare when 286 was being designed, right? (The `imul r16, r16` 2 operand form was added with 386, but the immediate forms were added with 286: http://www.posix.nl/linuxassembly/nasmdochtml/nasmdoca.html) – Peter Cordes Nov 18 '16 at 10:39
  • @supercat: Agreed that it's too bad Intel didn't do something more useful with the flags for IMUL. Although honestly, if it saved any transistors, I don't blame them. Reading flags after IMUL is really rare, and compilers still generally don't do that (because integer overflow detection is usually not done at all). It just means an extra instruction to detect overflow in some cases, or maybe having to use the one-operand form, I think. – Peter Cordes Nov 18 '16 at 10:59
  • @PeterCordes: I don't understand the rationale between putting the same value in C and OVF; since having instructions leave carry untouched can sometimes be helpful. I don't know how often one would need to keep a carry across a multiply, but for some bitmap graphics applications it might possibly be handy [though in the most common scenario where one might use a multiply--a graphics memory layout where bytes are stacked perpendicular to bits--one would probably want to use an LEA to advance a stripe, rather than using an INC and a MUL]. – supercat Nov 18 '16 at 15:33
  • @supercat: CF is special: `sbb eax,eax` = 0 or -1, while other flags just have set/j/cmovcc. Intel's documentation for MULX says that the point of flagless-MUL is to allow interleaving with ADC for extended-precision. Extended precision stuff usually bottlenecks on latency, not uop throughput, so interleaving two calculations would help. I guess Intel didn't think of that when designing 8086, or for the new forms, since they weren't yet planning a pipelined / superscalar x86? And if they were, they might have thought partial-flag updates would give a false dependency without tricks. – Peter Cordes Nov 18 '16 at 19:48
  • @PeterCordes: I think three-operand IMUL came in with the 80186, so I would expect that pileline complications from partial flag updates would have been a total non-consideration. – supercat Nov 18 '16 at 20:03
  • @supercat: See my earlier comment: it was 286 for 3-operand immediate, 386 for 2-operand. I was thinking 386 was late enough that other pipelined architectures were well known, but I agree it was probably not on Intel's radar at all. – Peter Cordes Nov 18 '16 at 20:09
  • @PeterCordes: The Intel 80386 architecture had its debut in 1985. I doubt there were any plans to make a super-scalar version when the architecture was introduced, but the 8086 compatibility was so important that the 80386 dominated other architectures that would have been more amenable to superscalar implementations. – supercat Nov 18 '16 at 20:48
  • @supercat: oh, I'd forgotten it was that early for 386. Perhaps the flag-setting behaviour was just overlooked :/ Or else they thought it would be confusing for different forms of IMUL to set flags differently. As always, backwards-compat reasons prevented them from changing it for one-operand IMUL :( – Peter Cordes Nov 18 '16 at 21:09
8

When multiplying two values, the least significant bits of the result are exactly the same, whether you do unsigned or signed multiplication. So, if you multiply two 32-bit values, you get a 64-bit result, the low 32-bits of which are the same, whether the multiplication is signed or unsigned. Same thing for a 64-bit multiplication, which produces a 128-bit result, the lower 64-bits of which are identical in both cases.

As such, compilers often use the IMUL instruction (whose mnemonic suggests signed multiplication) for both types of multiplication because it is more flexible than MUL, and generally faster. Whereas MUL comes in only one form (allowing an arbitrary general-purpose register or memory location to be multiplied by the implied destination register AL/AX/EAX/RAX), IMUL has many forms, including a one-operand form (same as MUL), a two-operand form (register or memory × register or memory or immediate), and a three-operand form (register or memory × immediate, storing the result in a third destination register). More details are available in Intel's documentation (see the tag wiki for links), or quick reference for MUL and IMUL.

The reason the compiler can get away with using IMUL all the time is because you throw away the high-order bits of the result. When you do a 32-bit × 32-bit multiplication and store the result in a 32-bit variable, the upper 32-bits of the entire 64-bit result were discarded. Again, same for a 64-bit × 64-bit multiplication, which discards the upper 64-bits of the 128-bit result, leaving only the lower 64-bits, which are the same whether it is a signed or unsigned multiply.

Quoting from the Intel manual:

The two- and three-operand forms [of IMUL] may also be used with unsigned operands because the lower half of the product is the same regardless if the operands are signed or unsigned. The CF and OF flags, however, cannot be used to determine if the upper half of the result is non-zero.

Peter Cordes has also explained this very well in a section of his larger answer to a very general question on two's-complement arithmetic operations.

Anyway, when writing the assembly code yourself, you have to decide whether you want to do the same thing the compiler does and throw away the upper bits of the products, or whether you want to keep them. If you don't care about the upper bits and assume that the operation will not overflow, write the same code as the compiler does.

If you do care about the upper bits, just use the MUL instruction, which sets the CF and OF flags if the product of the multiplication is larger than can fit into the type of its operands.

mov  rax, QWORD PTR [a]   ; put 64-bit operand 'a' into RAX
mov  rbx, QWORD PTR [b]   ; put 64-bit operand 'b' into RBX
mul  rbx                  ; multiply 'a' * 'b'
; 128-bit result is returned in RDX:RAX (upper-bits:lower-bits)

jo  ProductOverflowed

Using MUL here is almost certainly more efficient than trying to find a way to use IMUL and testing the high 64-bits afterwards to see if they are non-zero (which would indicate an overflow). Simply having a non-predictable branch would put you way behind in performance, compared to the 1 or 2 μops you would stand to save with IMUL.

Community
  • 1
  • 1
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • You accidentally used the 2-operand variant there – harold Jul 07 '16 at 20:02
  • @harold The 2-operand variant of IMUL was intentional...I'm not multiplying by an immediate value. – Cody Gray - on strike Jul 07 '16 at 20:04
  • 1
    But then the result is not 128 bits – harold Jul 07 '16 at 20:05
  • xD, we both posted our answers within minutes of each other. But you don't need to zero `rdx` ahead of a multiply. It's always a write-only operand. So `mov rax, [a]` / `imul rax, [b]` would work, except for the bug harold pointed out. (Where you actually need the one-operand form `mul [b]`.) Or if you really want to test the upper half manually, use `mov rdx, [a]` / `mulx rsi, rdi, [b]` / `test rsi,rsi`. ([MULX is in BMI2](http://www.felixcloutier.com/x86/MULX.html)) – Peter Cordes Jul 07 '16 at 20:11
  • Hmm, I was thinking the two-operand version behaved the same way as the one-operand version, putting the result in RDX:RAX. I guess I read the manual too quickly. Well, that makes the answer less useful. Oh well. – Cody Gray - on strike Jul 07 '16 at 20:19