Assembly why lea is fast?

Question

I had a conversation with my professor and he said:

leaq (%rax,%rax,8)

Is faster than:

imulq $9, %rax

I asked him why (in both cases we are doing multiplication with nearly same numbers) and he said we won't get into that.

Can Someone help me understand in simple way why leaq is fast in general?

a question that rose from the comments, is:

imulq $9, %rax

faster than doing 2 commands, one to shift left and other to add one %rax (which we could previously save in a register)

and why?

`leaq` performs a shift and an addition, not a multiplication. That's why it's possible to perform it in 1 cycle while the `imulq` takes 4-ish cycles. — fuz, Apr 10 '21 at 12:27
@fuz but why? I have 2 question: 1) what if I wrote leaq (%rax,%rax,7) shift won't help here as it can't be performed. 2) why it does shift I learnt that we calculate the value like (in1*r1)+r2+in2 for the following command leaq in2(%r2,%r1,in1) — , Apr 10 '21 at 12:34
@john `leaq (%rax, %rax, 7)` is not a legal instruction. The scale must be 1, 2, 4, or 8. It's really a shift, but it's written as a multiplication for the convenience of the programmer. — fuz, Apr 10 '21 at 12:51
@fuz got you, few last things why we write 8 and not $8 for example, I learnt that every time I need to write a constant number I should write $ before it, plus could you take a look on the updated question (last lines) — , Apr 10 '21 at 13:08
`8` is a memory operand, accessing memory at address `8`. `$8` is an immediate operand with value `8`. — fuz, Apr 10 '21 at 13:17
as you said for lea we write 8 not $8, why? 8 here means memory address... — , Apr 10 '21 at 13:22
The dollar sign indicated the *addressing mode*. It is not a general prefix for numbers, just a prefix to clarify the addressing mode. The syntax neither requires nor recognises this prefix anywhere else. — fuz, Apr 10 '21 at 13:28

score 4 · Answer 1 · answered Apr 10 '21 at 16:38

The lea (load effective address) is a way to perform the common operation of pointer arithmetic. How an instruction refers to its operands is called its addressing mode and lea supports scaled or base plus index plus offset addressing modes (among others).

address = base address + index * scaling + offset

where the scaling value can be one of few powers of two (1, 2, 4, 8). These values are useful for arrays of bytes, characters, integers, pointers etc. It is not capable of encoding or performing multiplication with arbitrary values. In the hardware, these few options can be implemented with a couple of multiplexers, with a fraction of a cycle of delay.

A multiplication instruction on the other hand, goes through a multiplication circuitry that can multiply two arbitrary full-width (64 bits) operands. This is an operation with significantly higher complexity. Even with multiple full-width adders in parallel, it has about six times (log n) the delay of a full-width addition (although the design might incorporate an optimization that allows it to multiply simpler values quicker).

Worth mentioning that the scale factor is encoded in machine code as a 2-bit shift-count. See also [Using LEA on values that aren't addresses / pointers?](https://stackoverflow.com/a/46597375) - LEA latency is 1 cycle only for simple-enough addressing modes. On Intel SnB-family CPUs, there aren't uops with 2-cycle latency, and `lea` with 3 components (two `+` operations) has 3 cycle latency. But yes, simpler LEAs like the one in the question (with no displacement) are 1 cycle latency and throughput of 2/clock. (Or 4/clock on Ice Lake) — Peter Cordes, Apr 10 '21 at 16:55
(On AMD, `lea` is 1 or 2-cycle latency, with similar throughput reduction (fewer ports) for slow LEA. And the conditions for being fast are more restrictive: a scale factor other than `1` makes it slow. https://uops.info/ https://agner.org/optimize/) — Peter Cordes, Apr 10 '21 at 18:25

Peter Cordes · Answer 2 · 2021-06-25T05:00:12.247

LEA is just a shift-and-add instruction, with a 2-bit shift count as part of the machine encoding. That's much cheaper to build in hardware than a full 64-bit multiplier, and why CPUs can easily have multiple execution units that can handle LEA uops. (For better than 1/clock throughput).

Note that LEA latency is 1 cycle only for simple-enough addressing modes (before Ice Lake). On Intel SnB-family CPUs, there aren't any uops with 2-cycle latency, and LEA with 3 components (two + operations) has 3 cycle latency. Apparently Intel couldn't or didn't fit enough gate-delays for 2 additions (or a 3->2 reduction and one addition) into a single ALU cycle until Ice Lake.

But yes, simpler LEAs like the one in the question (with no displacement) are 1 cycle latency and throughput of 2/clock on SnB-family, with "slow" LEAs only running on port 1 (the only execution port on SnB-family that can run integer uops with latency other than 1.)

Ice Lake is always 1c latency, 1 uop. 2/clock throughput for addressing modes including a scaled-index (shift-count != 0), or 4/clock otherwise. (Even for 3-component operations like lea 1(%rax, %rcx), %edx that would be a "slow LEA" on Skylake or Zen).

On AMD, lea is 1 or 2-cycle latency, with similar throughput reduction (fewer ports) for slow LEA. And the conditions for being fast are more restrictive: a scale factor other than 1 makes it slow. But Zen still has 2 execution units that can handle "slow" LEAs, 4 for fast LEAs. https://uops.info/ https://agner.org/optimize/

is imulq $9, %rax faster than doing 2 commands, one to shift left and other to add one %rax (which we could previously save in a register)

imul $9, %rax is 1 uop, 3c latency, 1/clock throughput on AMD since Zen, Intel since Nehalem. (https://uops.info/). Higher latency on older CPUs, especially for 64-bit operand-size.

shl $3, %rax / add %rcx, %rax is 2 uops for the front-end, but only has 2 cycle latency. (And probably an extra mov somewhere before that, for a 3rd uop).

However, any decent compiler would use lea (%rax, %rax, 8), %rax instead (a*9 = a + a*8) : 1 uop, 1c latency on Intel, 2/clock throughput so it's not worse in any way, and better in many ways. (Or at worst, 2 cycle latency on AMD because of the scaled index, but that's still better than imul.)

When you're looking at a single instruction or short sequence, performance isn't one dimensional, but rather 3: front-end uops, back-end ports, and latency for the critical path. There's no single-number cost you can add up across instructions to find out how long a block of instructions will take; the whole point of superscalar out-of-order execution is to find instruction-level parallelism, whose existence depends on how instructions use each other's results. (But sometimes you can say that one sequence is at least as good as another in every way, if it's the same or better in all 3 ways across all existing CPUs.)

For icelake and newer 32/64 bit ```lea``` has 1c latency / ALU ports & throughput for all modes. — Noah, Apr 11 '21 at 02:25
@Noah: Thanks, fixed, I was mixing something up in my memory. Also, yay; I wonder if 10nm really brought gate delays down enough to make this possible, or if some new circuit design reduced latency. Two additions and a shift don't seem like *that* much work, but maybe some of that 1c is getting results in and out, unlike piping between stages of `imul`. (I'm ignoring 16-bit operand-size in this answer on purpose; it's usually pointless and on Intel slower for the expected reason of being an extra input to do the merging requiring an extra uop.) — Peter Cordes, Apr 11 '21 at 02:34
BTW the circuit for a 3 operand addition could look like a normal 2 operand adder, but with a 3-to-2 reduction layer bolted to the front of it, which is just a layer of full adders. So nowhere near double the cost of a 2-operand adder. `psadbw` could be doing that too, but with more reductions. Of course I have no idea what circuits Intel is *actually* using.. — harold, Apr 11 '21 at 02:45
@harold: `psadbw` is 5 cycle latency (or 3 on SKL and later), brain fart on my part which I fixed while my previous comment was still editable. It's notable in doing all that work with a single *uop*, not a single *cycle* :P But sure, yeah I'd guess they optimize it some. Anyway yeah, good point that 3-input addition can be optimized, but OTOH that would require the shifted operand to be ready right away(?). A full barrel shift fits into 1 cycle on its own, but a 2-bit shift count still takes some gate-delays. — Peter Cordes, Apr 11 '21 at 02:51
@PeterCordes also it can get 4/clock throughput on icl and newer. re: "Intel slower for the expected reason of being an extra input to do the merging requiring an extra uop." Any chance you have a link to further reading for that? Know about length changing prefix but thats just FE AFAIK. — Noah, Apr 11 '21 at 03:20
@Noah: Oh right, the uops.info search I had open from earlier included `is` (scaled index) so only the 2/clock results were coming up. >.< Yeah, Ice Lake has totally different rules; it seems only a scale factor makes a slow LEA, not even 3-component LEA is slow. — Peter Cordes, Apr 11 '21 at 03:33
@Noah: `lea edx, [rax + rcx]` has 2 input registers and one pure-output. `lea dx, [rax + rcx]` has 3 input registers and one output: it reads RDX, merges a new low 16, and writes RDX. Intel *might* have crammed this into a single uop if 16-bit code was relevant, but it isn't. They only went to the trouble of handling 3-input integer uops for useful stuff like `cmov` and `adc` (which are always 3-input, not just with a special operand-size). — Peter Cordes, Apr 11 '21 at 03:35
As @harold says, implementations of `LEA` typically use a 3-input adder, which is a carry-save adder providing 3:2 compression followed by a carry-propagate adder. At least that was the design used in the Athlon processor (first hand knowledge; I was on the FPU design team for that CPU). The shifts are not done by a general-purpose barrel shifter but a 4:1 mux (multiplexer), which is much simpler and faster. — njuffa, Apr 11 '21 at 04:49

Assembly why lea is fast?

2 Answers2

Linked