x86 MUL Instruction from VS 2008/2010

Question

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.

If VS does not compile using MUL, is there a rationale why?

@Jeff M I think perhaps the poster meant that IMUL is used instead in the compiled code. — , Oct 28 '10 at 02:54
@pst: I was just asking because I didn't have access to the compiler and couldn't see what instructions were actually used. I caved in and booted up my dev machine to figure it out. :) — Jeff Mercado, Oct 28 '10 at 03:04
@Jeff M I'm curios (but not that curious ;-) and was trying to prompt the poster to add clarification :p — , Oct 28 '10 at 03:09
(edit wasn't working) To further clarify my question, I was essentially wondering whether Intel had released some kind of optimization suggestions surrounding MUL vs IMUL. Or whether MS had furnished a rationale for the instructions they use (less likely.) — Spain Train, Oct 28 '10 at 14:50
Possible duplicate of [Why is imul used for multiplying unsigned numbers?](http://stackoverflow.com/questions/42587607/why-is-imul-used-for-multiplying-unsigned-numbers) — phuclv, Mar 25 '17 at 01:53

score 31 · Accepted Answer · edited Jul 27 '19 at 16:10

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).

186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)

When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)

In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.

As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.

The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

Are you certain of the causality of IMUL/MUL optimizations? Is it possible that VS prefers IMUL because it happens to already be faster (vice compilers prefering it, causing Intel/AMD to make it faster)? — Spain Train, Oct 28 '10 at 14:47
@Mike: on the 80386, `mul` and `imul` offer the same speed, and C compilers were already using `imul` because of the convenience of choosing the registers. So I think that compilers chose first, and processor vendors followed, not the other way round. — Thomas Pornin, Oct 28 '10 at 17:25
in 64-bit mode it'll use mul for [`__int128`](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) — phuclv, Aug 12 '14 at 17:54
In 64-bit mode does `mul` return the upper 64-bit word? I mean is the lower 64-bit word in `rdx` and the upper in `rdx`? — Z boson, Mar 02 '15 at 08:49
The immediate form of the `imul` instruction was added on the 186, not the 286. — Myria, Jan 19 '17 at 19:22
@Myria: thanks, the current version of the NASM manual agrees with you and says 186. No idea why the older version had it wrong. — Peter Cordes, Jul 27 '19 at 16:12

score 9 · Answer 2 · answered Nov 01 '10 at 04:23

There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).

Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:

unsigned long long prod(unsigned int a, unsigned int b)
{
  return (unsigned long long) a * b;
}

The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.

It's splitting the result that's the problem for modern Intel CPUs (SnB family), not actually doing the wide multiply. `imul r,r` (any size including 64bit) is 1 uop. `imul r/m32` is 3 uops, while `imul r/m64` is only 2 uops, maybe because it doesn't have to split the output of the 64bit multiplier hardware? `imul r/m8` is 1 uop, presumably because the result goes in AX, so it's still only a single register. BTW, CPU architects would say that most x86 integer instructions write two output registers, including the flags. — Peter Cordes, Apr 03 '16 at 23:08

score 4 · Answer 3 · answered Oct 28 '10 at 03:38

According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).

I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.

This:

unsigned int func(unsigned int a, unsigned int b)
{ 
    return a * b;
}

Assembles to this (with -O2):

_func:
LFB2:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        movl    %esi, %eax
        imull   %edi, %eax
        movzbl  %al, %eax
        leave
        ret

score 2 · Answer 4 · answered Mar 03 '15 at 05:53

Right after I looked at this question I found MULQ in my generated code when dividing.

The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.

C++ code:

for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
    Remainder <<= 32;
    Remainder += Num;
    Num = Remainder / 1000000000;
    Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});

Optimized Generated Assembly

00007FF7715B18E8  lea         r9,[rsi-4]  
00007FF7715B18EC  mov         r13,12E0BE826D694B2Fh  
00007FF7715B18F6  nop         word ptr [rax+rax] 
00007FF7715B1900  shl         r8,20h  
00007FF7715B1904  mov         eax,dword ptr [r9]  
00007FF7715B1907  add         r8,rax  
00007FF7715B190A  mov         rax,r13  
00007FF7715B190D  mul         rax,r8  
00007FF7715B1910  mov         rcx,r8  
00007FF7715B1913  sub         rcx,rdx  
00007FF7715B1916  shr         rcx,1  
00007FF7715B1919  add         rcx,rdx  
00007FF7715B191C  shr         rcx,1Dh  
00007FF7715B1920  imul        rax,rcx,3B9ACA00h  
00007FF7715B1927  sub         r8,rax  
00007FF7715B192A  mov         dword ptr [r9],ecx  
00007FF7715B192D  lea         r9,[r9-4]  
00007FF7715B1931  lea         rax,[r9+4]  
00007FF7715B1935  cmp         rax,r14  
00007FF7715B1938  jne         NumToString+0D0h (07FF7715B1900h)

Notice the MUL instruction 5 lines down. This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.

I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.

This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

This is compiled using VS 2013, with default release settings. And I also found the same exact optimization on GCC -O2. — Benjamin, Mar 03 '15 at 05:54
A tremendous amount of effort was put into "[strength reduction](https://en.wikipedia.org/wiki/Strength_reduction)" of division (and multiplication) in GCC back in the 90s, when a lot of CPUs didn't even *have* a hardware divide instruction. The author of the paper you found was responsible for much of that work. If you are curious, read through [expmed.c](https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/expmed.c), the bulk of which is implementing this optimization. I didn't know MSVC could do this too but it doesn't surprise me. — zwol, Mar 05 '15 at 15:28
Blast from the past ^_^ That is an interesting example, and thanks for the link to the paper. @zwol thanks for additional context. — Spain Train, Mar 27 '15 at 17:07

Jeff Mercado · Answer 5 · 2010-10-28T04:36:36.407

My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.

score 1 · Answer 6 · answered Nov 24 '15 at 13:17

As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.

GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.

With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.

x86 MUL Instruction from VS 2008/2010

6 Answers6

Linked