1

I developed a program in x86-64 assembly which needs to iterate many times through the same operation:

IMUL rdx, 3   # rdx is always different

However, I need to make the runtime faster, so I thought of an optimization for that specific line from above:

MOV rcx, rdx
SHL rdx, 1
ADD rdx, rcx

Now I ask you guys: would this modification improve the runtime of the program (less clocks), or should I stick with the IMUL command?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    It doesn't work, `add rdx, rdx` just doubles `rdx` again instead of adding the original `rdx`, and `shl rdx, 2` multiplies by 4 already, but how about `lea rdx, [rdx + rdx * 2]`? – harold Jul 19 '19 at 10:26
  • I have seen this solution many times using `lea`, but I didn't manage to fully understand it. Would it be possible to clarify it? Also, would it be and improvement in performance? – Cosmin Aprodu Jul 19 '19 at 11:15
  • 2
    `lea` [evaluates the address and then stores that in the destination](https://stackoverflow.com/q/1658294/555045), it's not a memory access despite looking like one. – harold Jul 19 '19 at 11:44
  • Thanks! I also updated the question, but I guess now that is even less optimal. I will take a look on the explanation. – Cosmin Aprodu Jul 19 '19 at 11:46
  • Do you mean `IMUL rdx, rdx, 3`? There is no `IMUL` instruction with only one register and an immediate. – Andreas Abel Jul 19 '19 at 13:09
  • Indeed, there is. Take a look into the developer's manual: https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf&ved=2ahUKEwiXoqu-icHjAhUB0qYKHf3jAV8QFjACegQIBhAB&usg=AOvVaw0bkRab9lEOtDV0i4iY5kyf page 3-443 – Cosmin Aprodu Jul 19 '19 at 13:21
  • 2
    @AndreasAbel : You are correct there is no 2 operand form that takes an immediate but what many assemblers do is generate the 3 operand form from the 2 operands. So `imul rdx, 3` is encoded as `imul rdx,rdx, 3` (the instruction would be encoded as `48 6b d2 03`) – Michael Petch Jul 19 '19 at 13:27
  • Are you trying to improve latency or throughput? Which microarchitecture are you targeting? – Andreas Abel Jul 19 '19 at 13:29
  • The latency is what I am trying to improve. I am working on Intel Architecture, 64bit – Cosmin Aprodu Jul 19 '19 at 17:19

1 Answers1

10

Both are terrible compared to lea rdx, [rdx + rdx*2], using a scaled-index addressing mode to get a total of *3, which is why compilers will always use LEA if you ask them to compile a function like

long foo(long x){ return x * 3; } (https://godbolt.org/z/6p4ynV)


LEA is a way to feed arbitrary numbers through x86 addressing modes without using the result for a load or store, just putting it in a register. Using LEA on values that aren't addresses / pointers?


On all modern x86 CPUs, LEA is a single uop. The only question is how much better it is than the alternatives. imul is also 1 uop, but mov+shl+add is 3 for the front-end. (This is true across all mainstream and low-power Intel/AMD that are still relevant. See https://agner.org/optimize/) 64-bit imul is extra slow on some older microarchitectures, like Bulldozer-family and Silvermont/Goldmont, or especially older Atom.

On AMD CPUs (Bulldozer/Ryzen), it has a scaled index so it's a "complex" LEA and has 2 cycle latency (vs. 3 for imul on Ryzen, or much worse on Bulldozer-family where 64-bit imul is slower and not fully pipelined). On Ryzen this LEA still has 2-per-clock throughput.

On Intel CPUs, it only has 2 components (one +), so it's a "simple" LEA with 1 cycle latency and can run with 2-per-clock throughput. So about the same cost as one shl instruction, but runs on different ports.

(Or on Ice Lake, 4-per-clock since they added LEA units to the other 2 integer ALU ports. So it's exactly as cheap as one add on Ice Lake.)


You'd only want mov ; shl ; sub or add when your multiplier was 2^n +- 1 for n > 3. Then it is worth considering imul for a tradeoff between latency and front-end throughput cost.

By shifting the original register, even CPUs without mov-elimination (before IvyBridge and Ryzen) can run your mov/shl/add sequence with 2 cycle latency critical path length.


Also related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why? has some details about a problem with *3 vs. optimizing with LEA.

Other related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you for the detailed answer! One more small question: In this project I also use the `IDIV` command, but for 2 reasons: to extract the number rounded to the smaller integer and to extract modulo with `rdx`. Are there any possible optimizations, or the command should be used as it is? – Cosmin Aprodu Jul 19 '19 at 17:29
  • @CosminAprodu: Again, ask a compiler. Yes, multiplicative inverses are *much* faster than `div`. [Why does GCC use multiplication by a strange number in implementing integer division?](//stackoverflow.com/q/41183935). Given the quotient, you have to multiply and subtract to get the remainder. Look at compiler output to see how it's done (especially for signed integers where sign-handling is tricky). – Peter Cordes Jul 19 '19 at 17:31