x86-64 System V uses x87 for long double
, the 80-bit type. And pads it to 16-byte, with alignof(long double) == 16
so a long double will never cross a cache-line boundary. (Worth it or not, IDK; likely SSE2 was one of the motivations for supporting 16-byte alignment cheaply).
But anyway, SSE stuff isn't the only thing contributing to alignof(max_align_t) == 16
(which sets the minimum alignment that malloc is allowed to return).
The existence of__m128i
doesn't directly contribute to max_align_t
at all, for example 32-bit C implementations support it with lower malloc guarantees. Certainly the existence of __m256i
on systems supporting AVX didn't increase the alignment guarantees for allocators. (How to solve the 32-byte-alignment issue for AVX load/store operations?). But certainly it's convenient for vectorization, both auto and manual, that malloced memory is aligned enough for movaps
, especially on older CPUs when x86-64 was new and movups
had penalties even when the memory was aligned. It's hard for a compiler to take advantage of that guarantee if it only sees a float*
, you could have passed it a pointer into the middle of an allocation. But if it can see the malloc
of an output array, it knows it will be aligned if auto-vectorizing a loop that writes to that newly malloced space.
BTW, ISO C would let malloc
for a small allocation (like 1 to 15 bytes) return less-aligned space, since the space could still be used to hold any type that would fit. In C, an object can't require more alignment than its size. (e.g. you can't typedef an int
that always has to be at the start of a cache line, or if you do the sizeof expands with padding.)