Intel ICC 2018 vs GCC 8 assembly: stack init and finish difference

Question

icc and gcc are producing a bit different assemblies, in particular the stack initialization and function (main) cleanup.

Initialization:

icc:

push   %rbp
mov    %rsp,%rbp
and    $0xffffffffffffff80,%rsp
sub    $0x80,%rsp

gcc:

push   %rbp
mov    %rsp,%rbp
sub    $0x10,%rsp

Finish:

icc:

xor    %eax,%eax
mov    %rbp,%rsp
pop    %rbp
retq
nopl   0x0(%rax)

gcc:

mov    $0x0,%eax
leaveq
retq

Could someone explain those differences ? I can understand the gcc code, but the icc is more mysterious. Are the difference actual differences or arbitrary ? (eg., re-initialization of %rax at cleanup).

C++ code and Makefile (for repro):

#include <iostream>
using namespace std;
int main()
{
    int x = 3;
    x *= x;
    cout << x << endl;
    x = 3;
    int y = x + x;
    int z = x + 3;
    cout << (x * y) + (z / z) << endl;
}

and then the Makefile:

build: code.cpp
    icpc code.cpp -o code_i
    objdump -d code_i > code.icc
    g++ code.cpp -o code_g
    objdump -d code_g > code.gcc
    diff code.icc code.gcc > code.diff

Full files: on github

They are different compilers. They produce different machine code. Why did you expect the same machine code from different compilers? (what logic/assumption gave you that idea?) Because there's no reasonable reason for that. It would be sometimes handy to have only one possible solution of compilation, but that's not true with C++ language, there are millions of possible correct machine codes executing the task written in C++ source. Also you didn't switch on optimization, so there's no need to worry about things like re-initialization, etc. Debug is about fast compile times, not pretty code. — Ped7g, May 22 '18 at 20:59
I'm not expecting anything; only similarities and differences. I'm studying what a compiler does, and writing my own. Even if the code might be a bit different, there can be embeddings / {iso, homeo,?}morphings. — Soleil, May 22 '18 at 21:05
Not really, the possible ways how to compile certain C++ expressions are so vast, that usually with any non-trivial code the machine code between compilers will differ significantly. Check for example code in [this question](https://stackoverflow.com/a/50326207/4271923). Not only gcc vs clang use different instructions like some kind of morphing, but they end with completely different algorithm. If you want to write a compiler, or understand how they work, you should first read a book about how compilers are constructed, then check some source code of smaller compiler. Checking output is last. — Ped7g, May 22 '18 at 21:10
Anyway, the compiler must produce machine code, which produces correct effects observable on C++ abstract machine definition, as defined by C++ language. There's nothing about how the machine code should be, for example your code is compiled as `cout << 9 << endl << 19 << endl;` (with optimizations `-O3`), because there's no need to calculate those values with arithmetic, that's not observable through C++ abstract machine, only the result outputted to `cout` is (and not calculating them at runtime is much better for performance). — Ped7g, May 22 '18 at 21:15
I meant morphism, not morphings. @Ped7g are you telling the Rice theorem in a certain way ? It's not the point here. I'm having 2 implementations of a certain algorithm, and I'm saying there is a "link" between them. Actually when you look at the icc's code, there no more computation, it's mostly displaying the result. Also, one last thing, when there is no explicit option, optimization is `-o2`. — Soleil, May 22 '18 at 21:24
The default optimization level is `-O0` for both compilers (de-optimize for debugging). That's why gcc is making a stack frame with RBP at all. (ICC does it anyway, and still aligns the stack by 128; IDK why). And BTW, you can diff compiler asm output more easily on the Godbolt compiler explorer (https://godbolt.org/g/7sWU7V); it has a "diff view" feature. — Peter Cordes, May 22 '18 at 21:29
Actually it's `-O0` if `-g` is set, if not default optimization is `-O2` for icc https://software.intel.com/en-us/forums/intel-c-compiler/topic/290820 — Soleil, May 22 '18 at 21:37
Godbolt implicitly passes `-g`, so if that forum post is still accurate, that explains why ICC appears to default to `-O0` on Godbolt. gcc *definitely* defaults to `-O0`, so your comparison is totally bogus, then. GCC uses xor-zeroing at `-O2` and higher. — Peter Cordes, May 22 '18 at 21:39
You have 2 implementations of desired observable behaviour of C++ abstract machine. Not 2 implementations of the same algorithm. The compiler is free (and will!) to modify/replace your algorithm, as long as the observable end result is same. You can of course check diffs, but you put into question parts which are quite pointless (from asm point of view) and usually don't happen in release builds or hand written asm at all (unless you force the compiler to even bother with stack-frame pointer in `rbp`), it's sort of housekeeping/helper code and compiler vendors have their own needs over those. — Ped7g, May 22 '18 at 22:56
@Ped7g Interesting. Programs and machines actually do simulation. Whatever the way. I guess it's a huge difference between humans and machine, since with humans and life, the way we do things matters more than the result; because the way we do it changes us, but machines remains mostly the same. With FPGA and adaptive FPGA (Xilinx ACAP) this might change ! — Soleil, May 22 '18 at 23:07

Peter Cordes · Accepted Answer · 2018-05-22T22:43:49.877

re-initialization of %rax at cleanup.

Both compilers zero EAX because in C++ (and C99) int main() has an implicit return 0; at the bottom of the function.

gcc only looks for the xor-zeroing peephole optimization at -O2 and higher, but you're compiling with the default gcc -O0 (debug mode / no optimization / don't even keep variables in registers across statements.) ICC (and clang) use xor-zeroing even at -O0.

Both mov $0, %eax and xor %eax,%eax "initialize" RAX, i.e. break any dependency on the old value. mov $0, %eax is the inefficient way.

(Or without -g, ICC may be defaulting to -O2, but that doesn't change its prologue / epilogue choices. It still sets fast-math mode and calls a special Intel init functions at the top of main. You can more easily look at compiler asm output on the Godbolt compiler explorer. It implicitly passes -g, so ICC there definitely defaults to -O0.)

push   %rbp
mov    %rsp,%rbp

This makes a stack frame with RBP. GCC -O1 and higher enables -fomit-frame-pointer, so gcc won't waste instructions on that in a normal function.

ICC still does make a stack frame in main because it wants to align the stack by 128. (And a stack frame is the easiest way to restore the stack at the end of main after the unknown offset, so main can return).

# ICC stack over-alignment code:
and   $0xffffffffffffff80,%rsp   # round RSP down to the next multiple of 128
sub   $0x80,%rsp                 # and reserve 128 bytes
      # missed optimization: add $-0x80, %rsp could use an imm8 instead of imm32

I don't know why ICC aligns the stack in main. 128 is the size of the red-zone in the x86-64 SysV ABI, but that might be coincidence. It would mean that stuff inlined into main wouldn't have to worry about page-crossing for locals in the red-zone. (Cache-line size is 64B, page size is 4kiB).

The x86-64 System V ABI only guarantees 16-byte stack alignment, so future function calls won't preserve the 128-byte alignment. (GCC doesn't align the stack because main is already called with 16-byte stack alignment.)

If you'd picked any other function name instead of main, you wouldn't see much weird stuff.

sub    $0x10,%rsp

GCC is reserving 16 bytes of stack space for int x,y,z (and keeping the stack 16-byte aligned after push rbp). int takes 4 bytes in the x86-64 SysV ABI. GCC is storing them in memory because you compiled with optimization disabled.

If you'd compiled with -O2, g++ would have kept variable in registers and only used sub $8, %rsp to align the stack by 16 after function entry (instead of pushing anything).

Or with -mtune=haswell or something, I think recent gcc might push %rax instead of using sub to align the stack.

leave vs. mov %rbp,%rsp / pop %rbp.

GCC prefers leave for tearing down a stack frame if RSP isn't already pointing to the saved RBP value. (Otherwise it just uses pop rbp).

leave is 3 uops, according to Agner Fog's testing on Intel CPUs, but that might include a stack-sync uop if tested back-to-back. I haven't checked myself. mov/pop is only 2 uops total.

leave looks like a good optimization choice to me; IDK if Intel's tuning choice here is left over from older CPUs where multi-uop instructions could more easily cause decode problems, or if they've actually tested and found that 2 separate instructions are best on Haswell/Skylake.

At the end, what is the nopl 0x0(%rax)? ... I thought retq was the last instruction of main.

RET is the last instruction in main. The long-NOP is part of the padding between functions, from the compiler's .p2align 4 directive.

Many **thanks** for commenting the code. At the end, what is the `nopl 0x0(%rax)` ? to avoid a reset ? why `xor eax` and `nopl rax` ? And why after `retq` ? I thought `retq` was the last instruction of `main`. (https://stackoverflow.com/questions/12559475/what-does-nopl-do-in-x86-system) — Soleil, May 22 '18 at 22:32
What is the difference between 1) `and $0xffffffffffffff80,%rsp; sub $0x80,%rsp` and 2) `sub $0x10,%rsp` ? — Soleil, May 22 '18 at 22:34
@Soleil: Updated my answer. What are you asking about with XOR vs. NOP? Note that it's `nopl 0(%rax)`, i.e. it's using a long addressing mode to take up more space, not a register operand. If you mean why EAX vs. RAX, neither is using a REX or address-size prefix, and the default operand-size is 32, while the default address-size is 64, in long mode. — Peter Cordes, May 22 '18 at 22:45
I understand that `xor %rax,%rax` then `retq` is equivalent to `return 0;` in c. Then, why doing `xor` on `%eax` (by icc), then `retq` ? So, actually the `nopl 0x0(%rax)` is just a "padding" between `main()` and the next function in the assembly ? In that case, why mentionning `%rax` ? I saw many other functions finishing with `nopl 0x0(%rax,%rax,1); nopw %cs:0x0(%rax,%rax,1)`: what is it ? I saw after the last %cs on the left what I think is the padding: `4007bd: 00 00 00` but there are no instruction associated. Is that the "consequence of" or "another way to say" `nopl 0x0(%rax)` ? — Soleil, May 22 '18 at 23:01
@Soleil: [Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?](https://stackoverflow.com/q/11177137). I linked [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](https://stackoverflow.com/q/33666617) early in this answer, which goes into all the detail you could even want about xor-zeroing, and why `xor %eax,%eax` is the same as `xor %rax,%rax` but more efficient. (GCC zeros registers that way too, if you don't disable optimizations.) — Peter Cordes, May 22 '18 at 23:04
@Soleil `xor eax,eax` is shorter way of zeroing `rax`. There are various opcodes (and combinations of prefixes+opcodes) for `nop`-like instructions, so depending on how many bytes of padding is required, ICC opts for one of them. Actually it's a bit pointless exercise I think, because after `ret` the ordinary 5x `nop` would do as well, the longer `nop` opcodes make more sense when aligning loops/etc (i.e. when they get actually executed). Why ICC bothers between functions, where they are just dead code.. not sure. — Ped7g, May 22 '18 at 23:05
@Soleil: long-NOP (http://felixcloutier.com/x86/NOP.html) takes a ModR/M byte which is decoded but ignored. This allows the assembler to pad large amounts of space with fewer instructions to decode. What matters is the form of the addressing mode, not the specific registers used. (i.e. ModR/M + disp8 vs. CS-prefix + ModR/M + SIB + disp32.) RAX is just a placeholder. It would be `%eax` if an address-size prefix was also used for padding. — Peter Cordes, May 22 '18 at 23:07
I understand that writing assembly requires more culture (hardware culture), history and cooking knowledge than mathematics. So, in the end, we can absolutely mix 32 bits and 64 bits instructions at will; if it works, if it's more optimal. — Soleil, May 22 '18 at 23:22
@Soleil: yes, 32-bit is usually at least as fast as 64-bit operand-size, even in 64-bit mode, and gives smaller code-size. [May there be any penalties when using 64/32-bit registers in Long mode?](https://stackoverflow.com/q/40141719) and [The advantages of using 32bit registers/instructions in x86-64](https://stackoverflow.com/q/38303333). Except for stack instructions instructions like `push` or `ret` that default to 64-bit operand size and can't be overridden to 32-bit. (And `loop`, which you can override with an address-size prefix, but it's slow anyway so don't use it.) — Peter Cordes, May 22 '18 at 23:28

Intel ICC 2018 vs GCC 8 assembly: stack init and finish difference

Initialization:

Finish:

C++ code and Makefile (for repro):

1 Answers1