Alignment, Malloc Behavior, and Stack allocations
@Peter Cordes's comment addresses part of the problem.
Specifically on your OSX machine you are "always" getting correctly aligned memory which meets the alignment needs for the assembly generated for the __m256i
datatype, which is supposed to be 32 byte aligned. (I put "always" in quotations because it isn't guaranteed with malloc
. You may have just gotten lucky with OSX's malloc
. Multiple runs of the same code tend to get the same alignments, unlike repeated calls to malloc
inside one program. Actually, see below: the compiler on OS X generates different asm)
On Ubuntu, you are not getting a suitable alignment in the memory address returned by malloc
. (See below for details on why.)
You call your second code snippet a slight change
.
int main()
{
struct Box result[1];
(*result).L=(*result).L;
}
It is actually quite substantially different from the first snippet which uses malloc
because the the complier (gcc here) is aware of the alignment requirements of the data type Box
(and by extension __m256i
also) when allocating the memory on the stack. Thus there is no risk of segfaulting in this case because the complier provides the correct alignment.
You can manipulate the base pointer returned by malloc
as explained in this post https://stackoverflow.com/a/227900/3516034. I'll let you look there for the details but in short you can do something like
struct Box *result=NULL;
void *mem = malloc(2 * sizeof(struct Box));
result = (struct Box *)((uintptr_t)mem + offset);
where offset
allows you to explore the alignment and segfaults. It may be useful for you to print out the pointer address you end up using for result
like printf("0x%08" PRIXPTR "\n", (uintptr_t)result);
(again from that post).
Instruction Differences
Lastly, I can reproduce this on Ubuntu and OSX. I actually see my OSX malloc
calls giving 16 byte alignment and not 32 byte alignment. I also see 16 byte alignment on Ubuntu (in a VM on the same hardware) which is causing the segfault. When I manually align to 32 bytes, the segfault goes away.
So the root cause of your problem is that the systems are not generating the same assembly instructions. I got the assembly with the -S
option to gcc. On OSX, I see vmovaps
used 4 times with XMM operands, and on Ubuntu vmovdqa
twice with YMM operands, to move a __m256i
.
vmovdqa
and vmovaps
require their memory operands to be naturally-aligned (i.e. 32B for YMM, 16B for XMM). So the assembly generated on OSX only requires 16B alignment even though __m256i
is 32B.