3

How does this work exactly? I know that lea is efficient compared to using add/mov instructions because it doesn't go through the ALU or set any flags. So how is lea getting its addresses? What makes it better than add/mov?

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Instinct
  • 2,201
  • 1
  • 31
  • 45
  • Why is x86-lea a tag? – harold Apr 16 '18 at 03:50
  • @harold: Michael Petch and I [think it shouldn't be](https://stackoverflow.com/questions/1658294/whats-the-purpose-of-the-lea-instruction/12805360?noredirect=1#comment86713234_12805360) (because of lack of scope for future non-duplicate questions) but I'm not going to undo all the tagging right away. – Peter Cordes Apr 16 '18 at 06:33

2 Answers2

5

The idea that lea doesn't go through the ALU is outdated and has been wrong for over a decade. It goes through one of the ALUs - of which there will be several, and that is extremely unlikely to be a bottleneck. It actually takes time. It's not faster than an add.

But that doesn't mean it isn't useful. Unlike add, it can have a destination that differs from both inputs, so it can save a mov, and it can take an extra constant so you can do two adds and a mov all in one. The scaling is also nice, and all combined you can do something like a = b * 9 + 5 all in one lea a, [b + b * 8 + 5]. Such "complex" forms of lea are often slower than than simpler 2-operand forms of lea.

harold
  • 61,398
  • 6
  • 86
  • 164
3

The LEA instruction uses the effective address (EA) part of the CPU to do certain kinds of arithmetic without using the ALU. The EA module can do very specific kinds of arithmetic that is used for address calculation, but can also be used for other things if what you need is one of the things it specifically provides.

By not using the ALU for this EA calculation, the ALU can be busy doing something else at the same time and you can avoid a pipeline stall.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • I saw another post similar to this on stackoverflow by another person. "The fact that LEA goes through the address generation logic instead of the arithmetic units is also the reason why it used to be called "zero-clocks"; it takes no time to execute because address generation has already happened by the time it would be / is executed." Do you know what he means by "it takes no time to execute because address generation has already happened by the time it is executed"? How is an address generated by execution time? – Instinct Dec 09 '12 at 01:13
  • The CPU runs instructions in a "pipeline", instead of doing them one by one and completely finishing one instruction before starting the next. The pipeline means the CPU can be preparing for future instructions at the *same time* as it is finishing the execution of the current instruction. See https://en.wikipedia.org/wiki/Instruction_pipeline if you want to read more about this. – Greg Hewgill Dec 09 '12 at 01:18
  • This is maybe true on in-order Atom (pre Silvermont), but not anything else recent. Atom does run LEA earlier in the pipeline than normal ALU instructions, with implications for latency, but it needs its inputs ready earlier. All modern x86 CPUs have out-of-order execution, including even the current generation of Xeon Phi, so this doesn't. Also, in-order Atom has two ALU ports and 2-per-clock throughput for `add`, so the only effect of using LEA is a latency difference, not throughput. http://agner.org/optimize/ – Peter Cordes Apr 16 '18 at 06:37