Update: alignof(Edge)
was 16 because of long double
on x86-64 System V, so it's UB to have one at a less-aligned address. This tells GCC it's safe to use movaps
.
IDK why loading it from (%rbp)
didn't also use movaps
. I thought that implied Edge wouldn't be 16-byte aligned, so there's a whole section of this answer based on that guess (which I moved to the end).
Some types can require 16-byte alignment, notably long double
alignof(max_align_t) == 16
on x86-64 System V. A drop-in replacement for malloc
needs to return memory at least that aligned, for allocations of 16 bytes or larger.
(Smaller allocations of course couldn't hold a 16-byte object and therefore can't require 16-byte alignment. You can ask for a specific instance of an object to be over-aligned with alignas(16) int foo;
, but if a type itself has higher alignment it also has larger sizeof
so an array will still obey the normal rules as well as having every element satisfy the alignment requirement.)
See also Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where auto-vectorization with a misaligned uint16_t*
leads to a segfault. Also Pascal Cuoq's blog about alignment and having objects with less alignment than alignof(T)
is undefined behaviour, and how assumption of no UB runs deep for compilers.
Instruction selection
GCC and clang use movaps
whenever they can prove that memory must be sufficiently aligned. (By assuming no UB). On Core2 and earlier, and K10 and earlier, unaligned store instructions are slow even if the memory happens to be aligned at runtime.
Nehalem and Bulldozer changed that, but GCC still uses movaps
even with -mtune=haswell
, or even vmovaps
with -march=haswell
even though that can only execute on CPUs with cheap vmovups
.
MSVC and ICC never use movaps
, hurting perf on very old CPUs but letting you get away with misaligning data sometimes. They will fold aligned loads into memory operands for SSE instructions like paddd xmm0, [rdi]
(which requires alignment, unlike the AVX1 equivalent) so they will still make code that faults on misalignment sometimes, but usually only with optimization enabled. IMO that's not particularly great.
alignof(Point)
should only be 8 (inheriting the alignment of its most-aligned member, an int64_t
). So GCC can only prove 8-byte alignment for an arbitrary Point
, not 16.
For static storage, GCC can know that it chose to align the array by 16 and thus can use movaps
/ movdqa
to load from it. (Also, the x86-64 System V ABI requires that static arrays of 16 bytes or larger be aligned by 16, so GCC can assume this even for an extern unsigned char buffer[]
global defined in some other compilation unit.)
You haven't shown a definition for Edge
so IDK why it has 16-byte alignment, but possibly alignof(Edge) == 16
? Otherwise yes, that might to be a compiler bug.
But the fact that it loads the original Edge
object from the stack with movups
would seem to indicate that alignof(Edge) < 16
Possibly raw_memory = __builtin_assume_aligned(raw_memory, 8);
could help? IDK if that can tell GCC to assume lower alignment than it already thought it could assume based on other factors.
You could tell GCC that Edge
(or int
for that matter) can always be under-aligned by defining a typedef like this:
typedef long __attribute__((aligned(1), may_alias)) unaligned_aliasing_long;
may_alias
is actually orthogonal to alignment, but it's worth mentioning because one of the use-cases for this would be loads out of a char[]
buffer for parsing a byte stream. In that case you'd want both. That's an alternative to using memcpy(tmp, src, sizeof(tmp));
to do unaligned strict-aliasing-safe loads.
GCC uses may_alias
to define __m128
, and may_alias,aligned(1)
as part of defining _mm_loadu_ps
(the intrinsic for unaligned SIMD loads like movups
). (You don't need may_alias
for loading a vector of float from a float
array, but you do need may_alias
for loading it from something else.) See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
And see Why does glibc's strlen need to be so complicated to run quickly? for scalar code that I think is safe for under-aligned / aliasing unsigned long
, unlike glibc's fallback C implementation. (Which has to be compiled without -flto
so it can't inline into other glibc functions and break because of strict-aliasing violation.)
Allocators and assumed alignment
(This section was written assuming that alignof(Edge) < 16
. This was not the case here, and the function attributes might be useful to know about even though they're not the cause of the problem. And probably not a viable workaround either.)
You might be able to use __attribute__ ((assume_aligned (8)))
on your allocator to tell GCC about the alignment of the pointer it returns.
GCC may possibly be assuming for some reason that your allocator returns memory usable for any object (and alignof(max_align_t) == 16
on x86-64 System V because of long double
and other things, and also on Windows x64).
If this is not the case, you may be able to tell it that. This mmap
mis-alignment Q&A, we can see that GCC does "know about" malloc
and treat it specially. But if your function doesn't have an ISO C or C++ defined name, or GNU C attributes, that would be surprising. IDK, it's the best guess so far based on what you've shown, if it's not a compiler bug. (That is possible.)
From the GCC manual:
void* my_alloc1 (size_t) __attribute__((assume_aligned (16)));
void* my_alloc2 (size_t) __attribute__((assume_aligned (32, 8)));
declares that my_alloc1
returns 16-byte aligned pointers and that
my_alloc2
returns a pointer whose value modulo 32 is equal to 8.
I don't know why it would assume that a void*
returned by a function and cast to another type would have any more alignment than the type of the object being constructed, though. We can that it uses movups
to load an Edge
from somewhere. That would seem to indicate that alignof(Edge) < 16
.
Also relevant is __attribute__((alloc_size(1)))
to tell GCC that the first arg to the function is a size. If your function takes an explicit alignment as an arg, use alloc_align (position)
to indicate that, otherwise don't.