Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9?

Question

I have recently exploited a dangerous program and found something interesting about the difference between versions of gcc on x86-64 architecture.

Note:

Wrongful usage of gets is not the issue here.
If we replace gets with any other functions, the problem doesn't change.

This is the source code I use:

#include <stdio.h>
int main()
{
    char buf[16];
    gets(buf);
    return 0;
}

I use gcc.godbolt.org to disassemble the program with flag -m32 -fno-stack-protector -z execstack -g.

At the disassembled code, when gcc with version >= 4.9.0:

lea     ecx, [esp+4]            # begin of main
and     esp, -16
push    DWORD PTR [ecx-4]       # push esp
push    ebp
mov     ebp, esp
/* between these comment is not related to the question
push    ecx
sub     esp, 20
sub     esp, 12
lea     eax, [ebp-24]
push    eax
call    gets
add     esp, 16
mov     eax, 0
*/
mov     ebp, esp            
mov     ecx, DWORD PTR [ebp-4]  # ecx = saved esp
leave
lea     esp, [ecx-4]
ret                             # end of main

But gcc with version < 4.9.0 just:

push    ebp                     # begin of main
mov     ebp, esp
/* between these comment is not related to the question
and     esp, -16
sub     esp, 32
lea     eax, [esp+16]
mov     DWORD PTR [esp], eax
call    gets
mov     eax, 0
*/
leave
ret                             # end of main

My question is: What is the causes of this difference on the disassembled code and its benefits? Does it have a name for this technique?

Note: [DO NOT use `gets()`, it is dangerous](http://stackoverflow.com/q/1694036/2173917). use [`fgets()`](https://linux.die.net/man/3/fgets) instead. — Sourav Ghosh, Dec 23 '16 at 13:48
Different compiler configuration. Use `gcc -v` to compare configurations — LPs, Dec 23 '16 at 13:50
One major difference is that around gcc 4.9.x, they started to support C11, where the `gets` function was finally removed from the language, having been flagged for removal since the year 1999. So you should really consider getting a new source for learning C, since your current source is outdated by well over 17 years. — Lundin, Dec 23 '16 at 13:54
@Lundin: Nit-pick: It is outdated since 5 years with regard to the standard, because `gets` was allowed until C11 (from 2011). You are correct with regard to good (ans safe) coding style. But then it should never have been used (or made it into anything but the more informal K&R "bible"). — too honest for this site, Dec 23 '16 at 13:58
@Lundin I know it, but I just want to about why they disassemble the program so different. — lzutao, Dec 23 '16 at 14:00
@Olaf C99 future language directions 7.26.9. "`The gets function is obsolescent, and is deprecated.`" Meaning people should have stopped using it in the year 1999. There was an extremely long transit period of 12 years until the next standard was released. — Lundin, Dec 23 '16 at 14:01
@Lundin I think we should read the OP's question more careful? — mja, Dec 23 '16 at 14:13
I think you should show all assembly code instead of `... # some intructions`. This will probably eliminate some misunderstandings. — Jabberwocky, Dec 23 '16 at 14:16
To all: concerning `gets`: even if `gets` is totally outdated and it's use should be punished by death penalty or at least by formatting the user's hard disk: wrongful usage of `gets` is **not** the issue here. — Jabberwocky, Dec 23 '16 at 14:18
@Lundin the deprecation of `gets()` was first documented in the TC3 (technical corrigendum 3) circa 2007. Before the C11 standard was published, but years later than 1999. — Jonathan Leffler, Dec 23 '16 at 15:39
If you change the function body to initialize `buf` and call `puts()`, do you see the same change in prologue and epilogue? If so, you could avoid the trouble in future by using 'kosher' code even if you spotted the difference with dubious code. — Jonathan Leffler, Dec 23 '16 at 15:48
@JonathanLeffler No, it still be the same as original in my post. Btw, I don't know what 'kosher' code means. :) — lzutao, Dec 23 '16 at 16:26
Kosher is a term used by Jews for food that has been handled correctly, as laid down in the Talmud and later rabbinical teachings, so that it may be eaten without risk. In this context, it means 'avoiding the stigma that arises from the use of `gets()`' even though (or especially because) it was tangential to the main question. Basically, using `puts()` would have avoided all the controversy over `gets()`. And it could be any other function; `puts()` merely springs to mind as doing something faintly useful. — Jonathan Leffler, Dec 23 '16 at 16:31
@LPs Sorry about the late. This is the difference link: https://www.diffchecker.com/kSC8lH8k — lzutao, Dec 23 '16 at 16:32
@JonathanLeffler Thanks for the explanation. As you said, I don't see any changes on the problem if I replace `gets` with `puts`. Should I change the question to avoid the misunderstand of problem because of using `gets` function here? :) — lzutao, Dec 23 '16 at 16:41
At this stage, it is probably best to leave alone, but keep in mind next time you have a question that you should aim to avoid `gets()` if at all possible. — Jonathan Leffler, Dec 24 '16 at 04:54

Olivier · Answer 1 · 2016-12-23T18:10:33.957

1

I can't say for sure without the actual values in:

and     esp, 0xXX               # XX is a number

but this looks a lot like extra code to align the stack to a larger value than the ABI requires.

Edit: The value is -16, which is 32-bit 0xFFFFFFF0 or 64-bit 0xFFFFFFFFFFFFFFF0 so this is indeed stack alignment to 16 bytes, likely meant for use of SSE instructions. As mentioned in comments, there is more code in the >= 4.9.0 version because it also aligns the frame pointer instead of only the stack pointer.

edited Dec 23 '16 at 18:10

answered Dec 23 '16 at 16:40

Olivier

1,144
1
8
15

I am sorry, It's my mistake to replace `-16` with `0xXX`. Very sorry ! – lzutao Dec 23 '16 at 16:45
Thank you. But it also has data alignment with gcc version < **4.9.0** – lzutao Dec 23 '16 at 17:10
@lzutao You're right, I had not read the greyed out code. The < 4.9.0 code only aligns the stack pointer. The >= 4.9.0 code also aligns the frame pointer (`ebp`). – Olivier Dec 23 '16 at 17:31
Which ABI do you mean? Linux? Windows? OSX? BSD? Bare-metal? – too honest for this site Dec 23 '16 at 17:34
@Olaf whichever one he's subject to as question does not specify OS. One of the 32-bit ones however as the 64-bit are already 16 byte aligned as far as I know. The compiler inserts this to get more alignment than the ABI provides so it must think it is only 4 or 8 byte aligned on entry. – Olivier Dec 23 '16 at 18:07
How do you know the ABI does not have more strict alignment rules without knowing which ABI OP uses? IIRC, Linux-x64 uses 128 byte(!) alignment. – too honest for this site Dec 23 '16 at 20:08
Because the compiler adds the alignment instructions. OP mentions -m32 so this is likely one of the old 32-bit ABIs with poor alignment. I can't be absolutely certain of that but the compiler certainly thinks so or it wouldn't put the instructions there (assuming no compiler bugs). – Olivier Dec 23 '16 at 20:40

score 1 · Answer 2 · answered May 06 '18 at 19:21

The i386 ABI, used for 32-bit programs, imposes that a process, immediately after loaded, has to have the stack aligned on 32-bit values:

%esp Performing its usual job, the stack pointer holds the address of the bottom of the stack, which is guaranteed to be word aligned.

confront this with the x86_64 ABI¹ used for 64-bit programs:

%rsp The stack pointer holds the address of the byte with lowest address which is part of the stack. It is guaranteed to be 16-byte aligned at process entry

The opportunity gave by the new AMD's 64-bit technology to rewrite the old i386 ABI allow a number of optimizations that were lacking due to backward compatibility, among these a bigger (stricter?) stack alignment.
I won't dwell on the benefits of stack alignment but it suffices to say that if a 4-byte alignment was good, so is a 16-byte one.
So much that it is worth spending some instructions aligning the stack.

That's what GCC 4.9.0+ does, it aligns the stack at 16-bytes.
That explains the and esp, -16 but not the other instructions.

Aligning the stack with and esp, -16 is the fastest way to do it when the compiler only knows that the stack is 4-byte aligned (since esp MOD 16 can be 0, 4, 8 or 12).
However it is a destructive method, the compiler loses the original esp value.

But now it comes the chicken or the egg problem: if we save the original esp on the stack before aligning the stack, we lose it because we don't know how far the stack pointer is lowered by the alignment. If we save it after the alignment, well, we can't. We lost it in the alignment.
So the only possible solution is to save it in a register, align the stack and then save said register on the stack.

;Save the stack pointer in ECX, actually is ESP+4 but still does
lea     ecx, [esp+4]            #ECX = ESP+4

;Align the stack
and     esp, -16                #This lowers ESP by 0, 4, 8 or 12

;IGNORE THIS FOR NOW
push    DWORD PTR [ecx-4]  

;Usual prolog
push    ebp
mov     ebp, esp

;Save the original ESP (before alignment), actually is ESP+4 but OK
push    ecx

GCC saves esp+4 in ecx, I don't know why² but this values still does the trick.

The only mystery left is the push DWORD PTR [ecx-4].
But it turns out to be a simple mystery: for debugging purposes GCC pushes the return addresses just before the old frame pointer (before push ebp), this is where 32-bit tools expect it to be.
Since ecx=esp_o+4, where esp_o is the original stack pointer pre-alignment, [ecx-4] = [esp_o] = return address.

Note that now the stack is at 12 bytes modulo 16, thus the local variable area must be of size 16*k+4 to have the stack aligned at 16-byte again.
In your example k is 1 and the area is of 20 bytes in size.

The subsequent sub esp, 12 is to align the stack for the gets function (the requirement is to have the stack aligned at the function call).

Finally, the code

mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp leave lea esp, [ecx-4] ret

The first instruction is copy-paste error.
One could check it out or simply reason that if it were there the [ebp-4] would be below the stack pointer (and there is no red zone for the i386 ABI).

The rest is just undoing what's is done in the prolog:

;Get the original stack pointer
mov     ecx, DWORD PTR [ebp-4]          ;ecx = esp_o+4

;Standard epilog
leave                                   ;mov esp, ebp / pop ebp
                                        ;The stack pointer points to the copied return address                

;Restore the original stack pointer
lea     esp, [ecx-4]                    ;esp = esp_o
ret

GCC has to first get the original stack pointer (+4) saved on the stack, then restore the old frame pointer (ebp) and finally, restore the original stack pointer.
The return address is on the top of the stack when lea esp, [ecx-4] is executed, so in theory GCC could just return but it has to restore the original esp because main is not the first function to be executed in a C program, so it cannot leave the stack unbalanced.

¹ This is not the latest version but the text quoted went unchanged in the successive editions.
² This has been discussed here on SO but I can't remember if in some comment or in an answer.

Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9?

2 Answers2