MUL/DIV instructions vs. MOV & SHL/SHR (Pentium Pro)

Question

Why would you want to use:

MOV EAX, 22 
SHL EAX, 2

...when multiplying by 4 opposed to just using the MUL instruction?
I understand that this can also be done with SHR instead of DIV as well.

What are the advantages of doing this?
Also can you do this with odd numbers or can it only be even numbers?

think in base 10, shifting left/right to multiply by powers of 10 is far faster than doing the real multiplication (and no one does that any way). The same thing applies to multiply by the power of base in any bases — phuclv, Dec 04 '16 at 16:33
To learn more about what's fast in asm, see the [x86 tag wiki](http://stackoverflow.com/tags/x86/info), especially [Agner Fog's guides](http://agner.org/optimize). See also [this answer I wrote](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) about exactly how fast shift and LEA are compared to DIV. Modern Intel CPUs have extremely high performance multiply hardware (e.g. 3cycle latency, one per 1c throughput `imul r64, r64`), but immediate shifts are even faster (1c latency, two per clock tput). — Peter Cordes, Dec 05 '16 at 15:30
Why does "Pentium Pro" play a significant part in this question? a) not mentioned in the question body, b) They're long obsolete, c) the answer is relatively stable and useful with modern architectures. Delete from the question title? — Ira Baxter, Dec 05 '16 at 16:16

Ira Baxter · Answer 1 · 2016-12-04T09:32:47.653

There are a number of code idioms that are faster than "MUL constant".

Modern x86 CPUs execute a MUL in several clocks, minimum. So any code sequence that computes the product in 1-2 clocks will outperform the MUL. You can use fast instructions (ADD, SHL, LEA, NEG) and the fact that the processor may execute some of these instructions in parallel in a single clock to replace MUL. Arguably this means you can perform 4 of these instructions in many combinations in 2 clocks if you avoid some data dependencies.

The LEA instruction is particularly interesting because it can multiply by some small constants (1,2,3,4,5,8,9) as well as move the product to another register, which is one easy way to break data dependencies. That allows you to compute a sub-product without destroying the original operand.

Some examples:

Multiply EAX by 5, move product to ESI:

   LEA ESI, [EAX+4*EAX]    ; this takes 1 clock

Multiply EAX by 18:

   LEA  EAX, [EAX + 8*EAX]
   SHL  EAX, 1

Multiply EAX by 7, move result to EBX:

   LEA  EBX, [8*EAX]
   SUB  EBX, EAX

Multiply EAX by 28:

   LEA  EBX, [8*EAX]
   LEA  ECX, [EAX+4*EAX]  ; this and previous should be executed in parallel
   LEA  EAX, [EBX+4*ECX]

Multiply by 1020:

   LEA  ECX, [4*EAX]
   SHL  EAX, 10         ; this and previous instruction should be executed in parallel
   SUB  EAX, ECX

Multiply by 35

   LEA  ECX, [EAX+8*EAX]
   NEG  EAX             ; = -EAX
   LEA  EAX, [EAX+ECX*4]

So, when you want to achieve the effect of multiplying by a modest size constant, you have to think about how it can be "factored" into various products that the LEA instruction can produce, and how one might shift, add, or subtract a partial result to get the final answer.

It is remarkable how many multiply-by-constants can be produced this way. You might think this is only useful for really small constants but as you can see from the 1020 example above you can get some surprisingly medium size ones, too. This turns out be be really handy when indexing into arrays-of-structs because you have to multiply an index by the size of the struct. Often when indexing an array like this, you want to compute the element address and fetch the value; in this case you can merge a final LEA instruction into a MOV instruction, which you cannot do with a real MUL. This buys you additional clock cycle(s) in which to do the MUL by this type of idiom.

[I have built a compiler that computes the "best multiply by constant" using these instructions by doing a small exhaustive search of instruction combinations; it then caches that answer for later reuse].

`imul r, r/m, imm32` is pretty nice as a mov-and-multiply. On modern Intel CPUs, it only has 3 cycle latency (even for 64-bit operand-size), and one per clock throughput. A lot of multiply constants can be done in 2 cycles, though, as you nicely demonstrate with examples of instruction-level parallelism. gcc and clang do the same thing. (clang-3.6 and older typically prefers an IMUL if it can't do the job with just one LEA, but modern clang favours latency over instruction / uop count the way gcc does.) — Peter Cordes, Dec 05 '16 at 15:38

score 3 · Accepted Answer · edited Dec 04 '16 at 16:16

3

Using the SHL/SHR instruction is, generally speaking, much faster than MUL/DIV.

To answer your second question, you can do this with odd numbers as well, but you do have to add another instruction. So you can't technically just do it using the SHL/SHR.

For example: the following code multiplies by 5 without using the MUL instruction:

mov num, 5
mov eax, num
mov ebx, num
shl eax, 2    ; MULs by 4
add eax, ebx  ; ADD the x1 to make = 5

edited Dec 04 '16 at 16:16

Cody Gray - on strike

239,200
50
490
574

answered Dec 03 '16 at 20:56

CompSciFly

144
1
15

Brain fart.. that totally makes since. I was trying to over complicate the `SHL` using floats. – LearningProcess Dec 03 '16 at 21:08
3

How many cycles a shift is depends on the cpu model, but it hasn't been 1 clock per bit for a long time (if it ever was). He didn't ask about multiplying by 5 either, and you used `ADD` there :P – Jester Dec 03 '16 at 21:31
1

Jester can you make an answer? I'd like to see what you have to say. No reason to be rude to the guy in comment either. – LearningProcess Dec 03 '16 at 21:39
Multiply by 5: mov eax,num / lea eax,[eax + eax*4] . I had the impression that shl / shr on some processors takes the same time regardless of shift count. – rcgldr Dec 04 '16 at 00:37
4

Only on the 80186 did shifting cost 1 cycle per bit shifted. On the 8086 it cost 4 cycles per bit and it didn't support immediate operands for the shift count like you're using here. The 80286 and later CPUs all have barrel shifters which can perform shifts of any size in a single cycle. Modern out-of-order CPUs can potentially do two shifts at the same time effectively reducing the cost of shifting by any amount to just half a cycle. – Ross Ridge Dec 04 '16 at 02:29
@RossRidge: with the notable exception of the Pentium4 (which apparently has 4-5 cycles latency to do a shift). – ninjalj Dec 04 '16 at 15:53
Because the Pentium 4 lacked barrel-shifter hardware. Pretty much any time you see general statements about how 386 and later processors work, assume that they are accurate for everything *but* Pentium 4, which is an extreme oddball in many different respects. @ninja – Cody Gray - on strike Dec 04 '16 at 16:14
To multiply by 3, 5 or 9 [`LEA` will be better](http://stackoverflow.com/q/1658294/995714) – phuclv Dec 04 '16 at 16:26
2

@ninjalj No, the Pentium 4 isn't an exception. The latency was constant regardless of the number of bits shifted as the Pentium 4 also had a barrel shifter. There wasadditional overhead on lots of old CPUs, but the shift itself only takes one cycle. Since '286 the cost of the shift instructions doesn't depend on the number of bits shifted. – Ross Ridge Dec 04 '16 at 18:17
1

@ross Do you have a source for the claim that the Pentium 4 had a barrel shifter? Everything I've read strongly suggests that this hardware was omitted from P4, though it was present on all P6-gen processors, including its predecessor PIII and its successor Core. And performance tests *definitely* confirm that shifts are *significantly* slower on P4 than on other processors. – Cody Gray - on strike Dec 05 '16 at 13:06
@Cody Gray From what I see, the Pentium 4 does not have a barrel shifter. http://www.emulators.com/docs/pentium_1.htm – CompSciFly Dec 05 '16 at 13:22
2

Yeah, I've read all of Darek's articles. They are not particularly authoritative, and he's well known for going on loosely substantiated rants, but he is nevertheless a smart guy, and the relevant articles on the x86 architecture are worth a read if you have the time and are interested. But this particular issue is one that I've seen verified across multiple sources, so I'm curious why Ross Ridge is claiming the opposite. Maybe he knows something I don't. – Cody Gray - on strike Dec 05 '16 at 13:24
@CodyGray: According to Agner Fog, P4 `SHL r,i` is 4+1c latency, one per 1c throughput. `shl r,cl` is 6+0c latency, one per 1c throughput. Vector shifts like PSLLQ perform the same as PAND: 2+1c latency, one per 1 or 2c throughput (MMX vs. XMM). GP-integer SHL is probably slow only because of the partial-flags and potential dependency on input flags (when count=0). Agner doesn't mention any count-dependent latency. (And that would be insane: fully-pipelined but variable-latency means unpredictable write-back conflicts. Barrel shifters aren't expensive compared to that!) – Peter Cordes Dec 05 '16 at 15:21
1

@RossRidge: I think I might have a guess at this discrepancy between sanity and what [Wikipedia says](https://en.wikipedia.org/wiki/NetBurst_(microarchitecture)#Rapid_Execution_Engine): P4 has double-pumped integer ALUs. What Wikipedia is saying about "replacing the barrel shifter" might just mean that the shift/rotate ALU is not double-pumped, so is has much higher latency than ADD and so on, which Agner lists at 0.5c latency. – Peter Cordes Dec 05 '16 at 15:24
1

@CodyGray: Oh hmm, I guess P4 really did have something worse than a barrel shifter. Prescott improved SHL and SHR to 1c latency, one per 0.5c throughput. So maybe that 4c time for a shift was long enough for the worst-case shift count. I still haven't seen anything definitive, just people repeating the theory that this is due to Intel not including a barrel shifter. But there are other possible explanations for worse shift/rotate performance, so I won't be convinced until I see something from a credible CPU analyst or an Intel paper. – Peter Cordes Dec 05 '16 at 15:43
Hey, feel free to improve the answer so that it's as accurate as possible. The guy below me should've gotten the best answer. – CompSciFly Dec 05 '16 at 16:20
2

@CodyGray I don't have source, it just wouldn't make sense. Maybe its a slow barrel shifter where propagation delays push the execution time over one cycle, but the more likely explanation to me is that the delay comes from somewhere else. I suspect it's related to the shift instructions on the Pentium 4 (pre-Prescott) being listed as executed on the "mmxsh" subunit in Agner Fog table's. That suggests to me that additional cycles are the result of using the MMX barrel shifters for integer shifts, similar to the additional cycles MUL/DIV have because they're performed using the FP units. – Ross Ridge Dec 05 '16 at 18:13

MUL/DIV instructions vs. MOV & SHL/SHR (Pentium Pro)

2 Answers2