re-initialization of %rax at cleanup.
Both compilers zero EAX because in C++ (and C99) int main()
has an implicit return 0;
at the bottom of the function.
gcc only looks for the xor-zeroing peephole optimization at -O2 and higher, but you're compiling with the default gcc -O0
(debug mode / no optimization / don't even keep variables in registers across statements.) ICC (and clang) use xor-zeroing even at -O0
.
Both mov $0, %eax
and xor %eax,%eax
"initialize" RAX, i.e. break any dependency on the old value. mov $0, %eax
is the inefficient way.
(Or without -g
, ICC may be defaulting to -O2
, but that doesn't change its prologue / epilogue choices. It still sets fast-math mode and calls a special Intel init functions at the top of main
. You can more easily look at compiler asm output on the Godbolt compiler explorer. It implicitly passes -g
, so ICC there definitely defaults to -O0
.)
push %rbp
mov %rsp,%rbp
This makes a stack frame with RBP. GCC -O1
and higher enables -fomit-frame-pointer
, so gcc won't waste instructions on that in a normal function.
ICC still does make a stack frame in main
because it wants to align the stack by 128. (And a stack frame is the easiest way to restore the stack at the end of main
after the unknown offset, so main
can return).
# ICC stack over-alignment code:
and $0xffffffffffffff80,%rsp # round RSP down to the next multiple of 128
sub $0x80,%rsp # and reserve 128 bytes
# missed optimization: add $-0x80, %rsp could use an imm8 instead of imm32
I don't know why ICC aligns the stack in main
. 128 is the size of the red-zone in the x86-64 SysV ABI, but that might be coincidence. It would mean that stuff inlined into main
wouldn't have to worry about page-crossing for locals in the red-zone. (Cache-line size is 64B, page size is 4kiB).
The x86-64 System V ABI only guarantees 16-byte stack alignment, so future function calls won't preserve the 128-byte alignment. (GCC doesn't align the stack because main
is already called with 16-byte stack alignment.)
If you'd picked any other function name instead of main
, you wouldn't see much weird stuff.
sub $0x10,%rsp
GCC is reserving 16 bytes of stack space for int x,y,z
(and keeping the stack 16-byte aligned after push rbp
). int
takes 4 bytes in the x86-64 SysV ABI. GCC is storing them in memory because you compiled with optimization disabled.
If you'd compiled with -O2
, g++ would have kept variable in registers and only used sub $8, %rsp
to align the stack by 16 after function entry (instead of pushing anything).
Or with -mtune=haswell
or something, I think recent gcc might push %rax
instead of using sub
to align the stack.
leave
vs. mov %rbp,%rsp
/ pop %rbp
.
GCC prefers leave
for tearing down a stack frame if RSP isn't already pointing to the saved RBP value. (Otherwise it just uses pop rbp
).
leave
is 3 uops, according to Agner Fog's testing on Intel CPUs, but that might include a stack-sync uop if tested back-to-back. I haven't checked myself. mov
/pop
is only 2 uops total.
leave
looks like a good optimization choice to me; IDK if Intel's tuning choice here is left over from older CPUs where multi-uop instructions could more easily cause decode problems, or if they've actually tested and found that 2 separate instructions are best on Haswell/Skylake.
At the end, what is the nopl 0x0(%rax)
? ... I thought retq was the last instruction of main.
RET is the last instruction in main
. The long-NOP is part of the padding between functions, from the compiler's .p2align 4
directive.