The first one is the output for ++i
as part of ret = ++i
. It doesn't need to keep the old value around, because it's doing ++i
and then res=i
. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++
as part of ret = i++
. It needs to keep the old value of i
around, so it loads into a register and uses lea
to calculate i+1
in a different register. It could have just stored to ret
and then incremented the register before storing back to i
, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++
totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile
.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump
command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2
uses inc
for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem)
for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++
needs the old value of b
, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp)
.
Also that doesn't explain mov %eax, -4(%rbp)
. The compiler already used the address in %rax
for inc
, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)
? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0
, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b;
isn't valid because b++
isn't an lvalue. Well actually the comma operator lets you do b++, b = &b;
in one statement, but gcc -O0
still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b
would have to be volatile
to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++;
b = &b;
, notice that clang6.0 -O0
does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump
the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile
, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)