Those instructions aren't equivalent. For LEA, rax
is a pure output. For your hoped-for add
, it's rax += rsi
so the compiler would have to mov %rdi, %rax
first. That's less efficient so it doesn't do that.
lea
is a totally normal way for compilers to implement dst = src1 + src2
, saving a mov
instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax
implements result = y*3
. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3
to be the same variable as temp3
, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add
instruction without any extra mov
instructions, instead of needing lea
.
Clang chooses to do better register allocation than GCC so it can just use add
instead of needing lea
for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add
has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax
/ negq %rax
to create the 0 or -1
result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1
.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax
(into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea
at the end instead of an add
.
Compilers sometimes even get stuck with an extra mov
instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea
as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and
instead of imul
to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub
output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea
instead of mov/sub
as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x
and z
to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0
. But also temp2 = -((x & ~z) & 1)
.
We can rearrange that to -((x&1) & ~z)
which lets us start with not z
and and $1, x
in parallel, for better ILP. Or if z
might be ready first, we could do operations on it and shorten the critical path from x
-> answer
, at the expense of z
.
Or with a BMI1 andn
instruction which does (~z) & x
, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z
is ready before x
and/or y
. Using an extra mov
instruction, we could do x&1
in parallel with not z
Possibly you could do something with test
/setz
or cmov
, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3
is basically a conditional NOT of x
. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x
. Maybe improve ILP / latency by doing x+y
and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn
to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn
.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul
instead of and
(3 cycle latency just for that instruction).