Why stack grows by 16 bytes in this disassembly, when I only have one 4 byte local variable?

Question

I'm having trouble understanding why the compiler chose to offset the stack space in the way it did with the code I wrote.

I was toying with Godbolt's Compiler Explorer in order to study the C calling convention, when I came up with a simple code that puzzled me by its choices.

The code is found in this link. I selected GCC 8.2 x86-64, but am targetting x86 processors and this is important. Bellow is the transcription of the C code and the generated assembly reported by the Compiler Explorer.

// C code
int testing(char a, int b, char c) {
    return 42;
}

int main() {
    int x = testing('0', 0, '7');

    return 0;
}

; Generated assembly
testing(char, int, char):
        push    ebp
        mov     ebp, esp
        sub     esp, 8
        mov     edx, DWORD PTR [ebp+8]
        mov     eax, DWORD PTR [ebp+16]
        mov     BYTE PTR [ebp-4], dl
        mov     BYTE PTR [ebp-8], al
        mov     eax, 42
        leave
        ret
main:
        push    ebp
        mov     ebp, esp
        sub     esp, 16
        push    55
        push    0
        push    48
        call    testing(char, int, char)
        add     esp, 12
        mov     DWORD PTR [ebp-4], eax
        mov     eax, 0
        leave
        ret

Looking at the assembly column from now on, as I understood, line 15 is responsible for reserving space in the stack for the local variables. The problem is that I have only one local int and the offset was by 16 bytes instead of 4. This feels like wasted space.

Is this somewhat related to word alignment? But even if it is, if the sizes of the general purpose registers are 4 bytes, shouldn't this alignment be with regards to 4 bytes?

One other strange thing I see is with respect to the placement of the local chars of the testing function. They seem to be taking 4 bytes each in the stack, as seen in lines 7-8, but only the lower bytes are manipulated. Why not use only 1 byte each?

These choices are probably well intended, and I would really like to understand their purpose (or whether there is no purpose). Or maybe I'm just confused and didn't quite get it.

Now that I think of it, I believe the `char`s placement is indeed because of word alignment. Is it? — Arthur Araruna, Jan 16 '19 at 05:29
`Looking ate the assembly column`...dangerous, don't look, run away!! :P — Sourav Ghosh, Jan 16 '19 at 05:33
Interesting. Even if you enable optimization and change the args to `volatile char a` and so on (https://godbolt.org/z/ul--bv), they're still copied to separate locations on the stack instead of using the arg-passing slot as their permanent location. And they're still kept in different dwords. Also, only the `char` args are copied, unless you `b++` the int. This looks like a gcc missed optimization. The title question is a simple duplicate that's been asked many times (**the i386 System V ABI requires 16-byte stack alignment before a `call`**), but the `char` packing is slightly interesting. — Peter Cordes, Jan 16 '19 at 05:47
What makes you assume the compiler will not waste space and time, if you don't allow it to optimize? (`-O0`). Actually even if you would allow it to optimize, what makes you assume the compiler will have enough time to compute the perfect solution? (and let's ignore there may be not even single "perfect" solution for any medium+ size source code) The amount of time the compiler has is very small (seconds/minutes), so it does only scratch the total amount of possible outcomes, and thankfully to compiler design even that tiny scratch will often perform excellently well, but "wasting" a bit... — Ped7g, Jan 16 '19 at 05:47
about char align ... depends if you optimize for size or for speed, alignment helps with speed at the cost of space. So that one is not "wasted", that's intended, as it makes most of the SW under common circumstances "better". — Ped7g, Jan 16 '19 at 05:51
Just to clarify. I am studying C calling convention. It wouldn't make sense to allow the compiler to optimize the code, as I wouldn't see the function call. But the question arose because I cound't understand the reasons why those decisions were made. After some of the bitter and unforgiving comments, I'm now enlightened. Thank you. — Arthur Araruna, Jan 16 '19 at 05:57
Please put the code into your question. Stack Overflow wants all your questions to be self contained in case external sites go down. — fuz, Jan 16 '19 at 09:35
I think you get the answer [here](https://stackoverflow.com/questions/45697594/what-is-the-calling-convention-for-the-main-function-in-c). — alinsoar, Jan 16 '19 at 09:56

score 3 · Answer 1 · answered Jan 16 '19 at 06:10

3

So, by the comments, I could figure out that the stack growth issue is due to the i386 SystemV ABI requirements, as stated by @PeterCordes.

The reason why the chars are word aligned may be due to GCC's default behavior to improve speed, as maybe inferenced from @Ped7g's comment. Although not definite, this is a good enough answer for me.

answered Jan 16 '19 at 06:10

Arthur Araruna

339
3
12

Putting bytes in separate words looks like a missed optimization to me. GCC doesn't do that for locals, only for args, so probably it's just some left-over gcc internal attribute from them originally being passed as bytes padded to dwords on the stack. It only copies them into the function's own stack frame in the first place because of `-O0`, or if you use `volatile`. Anyway, Byte loads+stores within the same dword or not makes no performance difference when they're both in the same cache line, AFAIK on Intel or AMD CPUs. See https://agner.org/optimize/ – Peter Cordes Jan 16 '19 at 10:05
See also [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/q/46721075) (actually x86 can, so can everything else). Also other links in https://stackoverflow.com/tags/x86/info – Peter Cordes Jan 16 '19 at 10:09

score 0 · Answer 2 · answered Jan 21 '19 at 08:41

0

It's common today to acquire stack space in multiples of this size, for several reasons:

cache lines favor this behaviour by maintaining the whole data in the cache.
space for temporaries is preallocated, avoiding push and pop instructions to be used in case some temporary storage is needed out of the cpu.
individual push and pop instructions degrade pipeline execution, by requiring data to be updated before next instruction is executed. This decouples the data dependencies between consecutive instructions and allow them to run faster.

For this reasons, actual compilers specify ABIs to be designed in this way.

answered Jan 21 '19 at 08:41

Luis Colorado

10,974
1
16
31

1

Intel since Pentium-M and AMD since Bulldozer have a "stack engine" that tracks the updates to ESP/RSP, avoiding a data-dependency chain through the stack pointer from multiple push/pop instructions. (And letting them be single-uop instructions, unlike on earlier P6-family CPUs). This is why gcc used to use `mov` stores/loads for saving/restoring registers, and for writing function args to the stack (if there are any), but these days when tuning for modern CPUs it once again uses push/pop. [What is the stack engine in the Sandybridge microarchitecture?](https://stackoverflow.com/q/36631576) – Peter Cordes Jan 21 '19 at 14:36
1

Maintaining stack alignment before a `call` lets functions allocate aligned storage (e.g. for SSE vectors) cheaply. Compilers typically choose to always aim for 16-byte stack alignment when allocating more stack. But it's actually a missed-optimization that they don't use `push` for *initialized* locals that they were going to spill right away. See my answer on [What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?](https://stackoverflow.com/q/49485395) for an example. But unwind info needs metadata for each ESP/RSP change. – Peter Cordes Jan 21 '19 at 14:40

Why stack grows by 16 bytes in this disassembly, when I only have one 4 byte local variable?

2 Answers2