x86_64: is IMUL faster than 2x SHL + 2x ADD?

Question

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

Assembly:

  imul        rdx,qword ptr [a],1003Fh

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

`IMUL` is typically the order of 3-4 latency, throughput 1 per clock. So yeah, probably faster. Also note at least your last 3 instructions depend on each other so that by itself would produce an equivalent dependency chain. — Jester, Jun 20 '16 at 14:35
It would have been good in the old days: the 8086 IMUL took around 100 clocks, depending on addressing mode, register size etc. — Weather Vane, Jun 20 '16 at 14:39
If you assume the load from memory is free (since you'd have to wait for it *anyway*) and the register-register `mov`s can be eliminated, the `shl` can execute in parallel, then ideally, the shifting code would be about as fast as the multiply. However, it gunks up a lot of functional units and it's much more i-cache. — EOF, Jun 20 '16 at 14:41
@EOF: latency yes, throughput no (because the shift version bottlenecks on total uop throughput, if not on execution ports). Interesting point that with zero-latency mov and 2 execution ports that can run shift imm, the shift version is still only 3c latency. — Peter Cordes, Jun 20 '16 at 15:05

Peter Cordes · Accepted Answer · 2020-05-31T23:47:42.850

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).

Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.

The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.

Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.

gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)

e.g. this code on the Godbolt compiler explorer:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

clang3.8 and later makes the same code.

Does it mean then that an `imul`, while being faster, potentially consumes more energy than 4 ALU ops (since a multiplier activates a lot more transistors)? — rustyx, Jun 21 '16 at 07:17
@rustyx: IDK. Tracking 4 ALU uops through the pipeline and forwarding their results to each other takes more energy than tracking one, and that may well cost more than operating the multiplier! Out-of-order execution is expensive. I wish I had some idea what the numbers were, so I could guess at the answer, but I haven't done any power optimization (other than obvious stuff like using 128bit AVX vector ops when I don't need the upper 128 of the result). — Peter Cordes, Jun 21 '16 at 07:33
@rustyx: BTW, a high-latency multiplier using fewer transistors might still have a similar energy per multiply cost, since it still has to add all the partial sums. It just takes more cycles. (This is a massive over-simplification, and might actually be wrong. I'm not a gate-logic guy, and didn't really re-read those links in detail). A low-latency multiplier makes it tempting to use it more often, though (e.g. instead of two `lea` instructions to multiply by 10). — Peter Cordes, Jun 22 '16 at 12:32
Do you have a source for CPUs using Wallace trees for their multipliers? — Randomblue, Aug 22 '18 at 22:08
@Randomblue: no, I don't usually get into that level of detail of HW design. I don't know *why* Bulldozer has 4c latency / 2c throughput multiply vs. Ryzen and Intel's 3c latency / 1c throughput `imul r32,r32`, just that those are the performance numbers. IDK which microarchitectures use which techniques. Wikipedia mentions some other options, and has some links. — Peter Cordes, Aug 22 '18 at 22:12
@Randomblue: [How modern X86 processors actually compute multiplications?](https://stackoverflow.com/a/62117473) says all modern CPU ALUs use Dadda Trees — Peter Cordes, May 31 '20 at 23:46

x86_64: is IMUL faster than 2x SHL + 2x ADD?

1 Answers1

Linked

Related