I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps
to relax the alignment constraint and use _mm_loadu_ps
. There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code
is here. https://gist.github.com/rmcgibbo/7689820
The results on a 64 bit Macbook Pro with a Sandy Bridge Core i5 are below. Lower numbers indicate faster performance. As I read the results, I see basically no performance penalty from using _mm_loadu_ps on unaligned memory.
I find this surprising. Is this a fair test / justified conclusion? On what hardware platforms is there a difference?
$ gcc -O3 -msse aligned_vs_unaligned_load.c && ./a.out 200000000
Array Size: 762.939 MB
Trial 1
_mm_load_ps with aligned memory: 0.175311
_mm_loadu_ps with aligned memory: 0.169709
_mm_loadu_ps with unaligned memory: 0.169904
Trial 2
_mm_load_ps with aligned memory: 0.169025
_mm_loadu_ps with aligned memory: 0.191656
_mm_loadu_ps with unaligned memory: 0.177688
Trial 3
_mm_load_ps with aligned memory: 0.182507
_mm_loadu_ps with aligned memory: 0.175914
_mm_loadu_ps with unaligned memory: 0.173419
Trial 4
_mm_load_ps with aligned memory: 0.181997
_mm_loadu_ps with aligned memory: 0.172688
_mm_loadu_ps with unaligned memory: 0.179133
Trial 5
_mm_load_ps with aligned memory: 0.180817
_mm_loadu_ps with aligned memory: 0.172168
_mm_loadu_ps with unaligned memory: 0.181852