Let's consider the following function:
#include <stdint.h>
uint64_t foo(uint64_t x) { return x * 3; }
If I were to write it, I'd do
.global foo
.text
foo:
imul %rax, %rdi, $0x3
ret
On the other hand, the compiler generates two additions, with -O0
:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 89 7d f8 mov %rdi,-0x8(%rbp)
8: 48 8b 55 f8 mov -0x8(%rbp),%rdx
c: 48 89 d0 mov %rdx,%rax
f: 48 01 c0 add %rax,%rax
12: 48 01 d0 add %rdx,%rax
15: 5d pop %rbp
16: c3 retq
or lea
with -O2
:
0000000000000000 <foo>:
0: 48 8d 04 7f lea (%rdi,%rdi,2),%rax
4: c3 retq
Why? Since every assembly instruction equals one processor clock tick, my version should run within 2 CPU clock cycles (since it has two instructions), in the -O0
we need 4 cycles for performing addition, because it could be rewritten to
mov %rdi,%rax
add %rax,%rax
add %rdi,%rax
retq
and the lea
should take two cycles either.