LEA is just a shift-and-add instruction, with a 2-bit shift count as part of the machine encoding. That's much cheaper to build in hardware than a full 64-bit multiplier, and why CPUs can easily have multiple execution units that can handle LEA uops. (For better than 1/clock throughput).
Note that LEA latency is 1 cycle only for simple-enough addressing modes (before Ice Lake). On Intel SnB-family CPUs, there aren't any uops with 2-cycle latency, and LEA with 3 components (two + operations) has 3 cycle latency. Apparently Intel couldn't or didn't fit enough gate-delays for 2 additions (or a 3->2 reduction and one addition) into a single ALU cycle until Ice Lake.
But yes, simpler LEAs like the one in the question (with no displacement) are 1 cycle latency and throughput of 2/clock on SnB-family, with "slow" LEAs only running on port 1 (the only execution port on SnB-family that can run integer uops with latency other than 1.)
Ice Lake is always 1c latency, 1 uop. 2/clock throughput for addressing modes including a scaled-index (shift-count != 0), or 4/clock otherwise. (Even for 3-component operations like lea 1(%rax, %rcx), %edx
that would be a "slow LEA" on Skylake or Zen).
On AMD, lea
is 1 or 2-cycle latency, with similar throughput reduction (fewer ports) for slow LEA. And the conditions for being fast are more restrictive: a scale factor other than 1
makes it slow. But Zen still has 2 execution units that can handle "slow" LEAs, 4 for fast LEAs.
https://uops.info/ https://agner.org/optimize/
is imulq $9, %rax
faster than doing 2 commands, one to shift left and other to add one %rax (which we could previously save in a register)
imul $9, %rax
is 1 uop, 3c latency, 1/clock throughput on AMD since Zen, Intel since Nehalem. (https://uops.info/). Higher latency on older CPUs, especially for 64-bit operand-size.
shl $3, %rax
/ add %rcx, %rax
is 2 uops for the front-end, but only has 2 cycle latency. (And probably an extra mov somewhere before that, for a 3rd uop).
However, any decent compiler would use lea (%rax, %rax, 8), %rax
instead (a*9 = a + a*8
) : 1 uop, 1c latency on Intel, 2/clock throughput so it's not worse in any way, and better in many ways. (Or at worst, 2 cycle latency on AMD because of the scaled index, but that's still better than imul.)
When you're looking at a single instruction or short sequence, performance isn't one dimensional, but rather 3: front-end uops, back-end ports, and latency for the critical path. There's no single-number cost you can add up across instructions to find out how long a block of instructions will take; the whole point of superscalar out-of-order execution is to find instruction-level parallelism, whose existence depends on how instructions use each other's results. (But sometimes you can say that one sequence is at least as good as another in every way, if it's the same or better in all 3 ways across all existing CPUs.)