3

The GNU documentation states that malloc is aligned to 16 byte multiples on 64 bit systems. Why is this?

If my understanding is correct, registers and all instructions operate on values that are a maximum of 8 bytes wide. Thus, it would seem that 8 byte alignment would be the requirement.

Notes:

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    Perhaps because the exact same reason the stack is 16-byte aligned? And note that some types in C might be larger than 8 bytes (for example `long double` *could* be longer, and then there's the SSE types, even though they are extensions of the language). – Some programmer dude Jan 13 '22 at 07:31
  • Interesting. It seems to me they could trivially align it to the largest type at compile time if that was the reason. Alignment for the SSE types seem like the most compelling reason I've come across. But, no one has said this definitively. – Moss Richardson Jan 13 '22 at 08:08
  • What do you mean with "largest type at compile time"? Do you suggest the compiler checks the largest type that is used by the program and adjust accordingly? The library containing `malloc` is compiled at another time than your program and each compilation unit is compiled separately. The compiler does not know anything about code in other C files or libraries. – Gerhardh Jan 13 '22 at 08:48
  • At compile time for glibc. The compiler obviously knows the target architecture and word size. Certain types are guaranteed to default to the word size (at least when using gcc). When malloc is compiled, they could detect the word size and set it accordingly. Obviously, they're choosing something greater than the word size which was my point. – Moss Richardson Jan 13 '22 at 10:16

2 Answers2

7

x86_64 uses xmm registers (heavily -- all fp stuff is done in xmm registers as the 8087 fp registers are deprecated), and xmm registers require 16-byte alignment for (efficient) access.

So most things in x86_64 (both the stack and heap allocated by malloc) are organized to always be 16-byte aligned, so the compiler is always free to use the 'aligned' instructions when xmm registers are involved and does not need to use the (possibly slower) unaligned instructions.

On newer hardware, the compiler does not even need to go to the trouble of using the aligned instructions -- the unaligned instructions are as fast as the aligned instruction when the memory is aligned.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • There is no scenario in which regular floating point stuff would do a 16-byte load/store to malloc'ed memory though. Regular FP stuff can trigger 16-byte spills and reloads, but that's to stack memory. The only way to get a 16-byte store to malloc'ed memory is to write it yourself, and you can also do that in 32-bit code. – harold Jan 13 '22 at 10:20
  • @harold: if you enable optimization on most compilers (gcc and clang at least), they'll vectorize things and turn accesses of all kinds of memory (not just stack) into wider accesses. Unfortunately, with gcc you still need to tell it explicitly that things are aligned to use the aligned accesses -- it does not infer it automatically from the use of `malloc`. But on newer hardware that turns out to not matter too much -- the unaligned ops are as fast as the aligned ops when the memory is aligned. – Chris Dodd May 21 '22 at 20:09
3

x86-64 System V uses x87 for long double, the 80-bit type. And pads it to 16-byte, with alignof(long double) == 16 so a long double will never cross a cache-line boundary. (Worth it or not, IDK; likely SSE2 was one of the motivations for supporting 16-byte alignment cheaply).

But anyway, SSE stuff isn't the only thing contributing to alignof(max_align_t) == 16 (which sets the minimum alignment that malloc is allowed to return).

The existence of__m128i doesn't directly contribute to max_align_t at all, for example 32-bit C implementations support it with lower malloc guarantees. Certainly the existence of __m256i on systems supporting AVX didn't increase the alignment guarantees for allocators. (How to solve the 32-byte-alignment issue for AVX load/store operations?). But certainly it's convenient for vectorization, both auto and manual, that malloced memory is aligned enough for movaps, especially on older CPUs when x86-64 was new and movups had penalties even when the memory was aligned. It's hard for a compiler to take advantage of that guarantee if it only sees a float*, you could have passed it a pointer into the middle of an allocation. But if it can see the malloc of an output array, it knows it will be aligned if auto-vectorizing a loop that writes to that newly malloced space.

BTW, ISO C would let malloc for a small allocation (like 1 to 15 bytes) return less-aligned space, since the space could still be used to hold any type that would fit. In C, an object can't require more alignment than its size. (e.g. you can't typedef an int that always has to be at the start of a cache line, or if you do the sizeof expands with padding.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847