x86_64 stack alignment - purpose of excessive bytes

Question

Recently I was learning about the topic of memory alignment and related issues and it led me to a following program:

#include <cstdio>
#include <cstdint>


struct XX
{
    uint8_t a;
    uint32_t b;
    uint16_t d;
    uint64_t c;
};

int main()
{
     printf("\n\ntype size: %zu\n", sizeof(XX));

     XX one;
     XX two;
     XX three;
     printf("addresses of one-three:\n\t%p\n\t%p\n\t%p\n", reinterpret_cast<void *>(&one), reinterpret_cast<void *>(&two), reinterpret_cast<void *>(&three));
     printf("\ndifference of addresses between one and two: %lu\n", reinterpret_cast<unsigned long>(&one) - reinterpret_cast<unsigned long>(&two));
     printf("difference of addresses between two and three: %lu\n", reinterpret_cast<unsigned long>(&two) - reinterpret_cast<unsigned long>(&three));
     printf("alignment of type alone: %zu\n", alignof(XX));

     XX arr[2];
     printf("\ndifference of addresses in array: %lu\n", reinterpret_cast<unsigned long>(&arr[1]) - reinterpret_cast<unsigned long>(&arr[0]));
     printf("alignment of array type: %zu\n", alignof(XX[]));
}

I compiled this with GCC 8.1.0 as:

g++ -std=c++17 -O0 main.cpp

The ouput says that:

alignment is 8-byte,
size is 24 bytes,
XX instances differ by 24 bytes inside the array (contiguous memory - no surprise), but by 32 bytes as free-standing variables.

Why are there excessive 8 bytes between the free-standing variables?

The x86-64 System V ABI guarantees that automatic-storage arrays >= 16 bytes get 16-byte alignment. I forget if structs/unions get the same thing. Maybe gcc is applying the same thing to structs, even if the ABI doesn't require it. — Peter Cordes, Dec 15 '18 at 21:43
`clang -O3` packs them together with only 24 bytes separating them, so it's probably a gcc implementation detail / missed-optimization. And BTW, you should probably print the signed difference between the pointers (`ptrdiff_t`). `gcc -O3` allocates them in the opposite order, and `-32` is a lot more readable than `18446744073709551584`. — Peter Cordes, Dec 15 '18 at 22:18
*as free-standing variables* complier must not place it contiguous or in some order. take address different between free-standing variables in stack no big sense - can be any result. however align and must be 8 and size of item 24 here — RbMm, Dec 15 '18 at 22:39
@PeterCordes Yes, I have tried and noticed that, so I intentionally examined the -O0 option thinking of this as most "human understandable". — Marcin, Dec 15 '18 at 23:10
@RbMm Ok, but **what for** wants the compiler to do this instead of just placing the variables one after another? — Marcin, Dec 15 '18 at 23:11
@Marcin: `-O0` means "fast code-gen, consistent debugging, and don't optimize more than necessary". If there are any optimization passes that try to optimize stack layout, `-O0` would omit them, so it's a poor choice. The consistent-debugging part of `-O0` makes it pretty bad for human readability, forcing a store/reload between every C statement. See [Why does clang produce inefficient asm for this simple floating point sum (with -O0)?](https://stackoverflow.com/q/53366394) for why it's so terrible. — Peter Cordes, Dec 15 '18 at 23:31
@RbMm: of course there's no guarantee from ISO C++ they're contiguous. The OP is asking why *in this case* gcc compiling for x86-64 leaves padding. Is there a reason it's necessary, or is it a missed optimization? — Peter Cordes, Dec 15 '18 at 23:32
@PeterCordes On the other hand: why would the compiler "miss" the optimisation? What's not optimal in aligning the struct instances linearly? I think that leaving the 8-byte padding for an unknown (by us yet) purpose would be space and time wasting. — Marcin, Dec 15 '18 at 23:55
It might be aligning the start of each object to 16 bytes. It might do that for all large objects, because that's what the ABI requires for arrays. This could potentially be good for copying with SSE or AVX. It's rare not to have any smaller local variables that can fill the gaps. If you expect perfect optimization from your compilers, prepare for disappointment when you actually look at asm. It's often good but rarely optimal. — Peter Cordes, Dec 16 '18 at 00:14
*It might be aligning the start of each object to 16 byte* - that's the only reasonable explanation for me by far. Personally I don't believe the compiler would "think" of SSE instructions when examining such a simple code that has nothing in common with those. I expect *lack* of optimization when *-O0*, but nevertheless adding a 8-byte padding by the compiler seems somehow odd to me. — Marcin, Dec 16 '18 at 00:53
I didn't mean that it there's any code in gcc that checks whether or not SSE for copying / comparing any given object looks useful, as part of an alignment heuristic. I meant that perhaps larger objects are just *always* aligned, and SSE was the motivation for that choice. (e.g. `one = two` can compile to two `movaps` instructions, or `movaps` + `movq`. We'd need `movups` if they had different alignments relative to a 16-byte boundary.) — Peter Cordes, Dec 16 '18 at 11:33
Ok, perhaps now I have understood what you mean: 16-bit alignment is made by GCC in advance for all such structs to have a potential opportunity to use SIMD instructions because the cost of a few additional bytes on the stack is negligible, whereas it saves the compiler from performing some complicated checks each time. Right? :) — Marcin, Dec 16 '18 at 22:34
semi-related: [Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment?](https://stackoverflow.com/q/63009070). But here, note that the x86-64 ABI mandates 16-byte alignment for local arrays >= 16 bytes. GCC may be giving the same alignment to any large object including structs. And yes, I suggested earlier in comments that GCC may be always aligning large objects so it can always do efficient code-gen for copying them with aligned 16-byte loads/stores. Simpler GCC internals, and maybe some speed gain in exchange for stack space. — Peter Cordes, Jul 25 '20 at 05:35

x86_64 stack alignment - purpose of excessive bytes

0 Answers0