Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

Question

I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps. There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code is here. https://gist.github.com/rmcgibbo/7689820

The results on a 64 bit Macbook Pro with a Sandy Bridge Core i5 are below. Lower numbers indicate faster performance. As I read the results, I see basically no performance penalty from using _mm_loadu_ps on unaligned memory.

I find this surprising. Is this a fair test / justified conclusion? On what hardware platforms is there a difference?

$ gcc -O3 -msse aligned_vs_unaligned_load.c  && ./a.out  200000000
Array Size: 762.939 MB
Trial 1
_mm_load_ps with aligned memory:    0.175311
_mm_loadu_ps with aligned memory:   0.169709
_mm_loadu_ps with unaligned memory: 0.169904
Trial 2
_mm_load_ps with aligned memory:    0.169025
_mm_loadu_ps with aligned memory:   0.191656
_mm_loadu_ps with unaligned memory: 0.177688
Trial 3
_mm_load_ps with aligned memory:    0.182507
_mm_loadu_ps with aligned memory:   0.175914
_mm_loadu_ps with unaligned memory: 0.173419
Trial 4
_mm_load_ps with aligned memory:    0.181997
_mm_loadu_ps with aligned memory:   0.172688
_mm_loadu_ps with unaligned memory: 0.179133
Trial 5
_mm_load_ps with aligned memory:    0.180817
_mm_loadu_ps with aligned memory:   0.172168
_mm_loadu_ps with unaligned memory: 0.181852

Are you seriously running high performance code in Python? Python is 2 orders of magnitude slower than C/C++. As for unaligned loads, they are faster than before when using newer hardware (Haswell) due to uArch changes. The big problem with unaligned reads/writes, is that it may **cross cache line boundaries** or even page boundaries (less common of course). — egur, Nov 28 '13 at 07:55
Using scipy.weave in python just seemed to be the easiest way to test this snippet of C on different machines and post to stackoverflow. The actual code that I'm considering changing is a C library that's wrapped in python. All of the compute happens in the C layer. — Robert T. McGibbon, Nov 28 '13 at 08:06
And none of this is on Haswell. I'm seeing slight performance advantages of _mm_loadu_ps vs _mm_load_ps on Westmere and Sandy Bridge . — Robert T. McGibbon, Nov 28 '13 at 08:10
Right now, you're testing MOVUPS on an aligned argument. Take a look at what happens when it's actually being used for an unaligned move. — , Nov 28 '13 at 08:18
@duskwuff: I tried using unaligned memory as you suggested by offsetting the start of the array, and edited the question. — Robert T. McGibbon, Nov 28 '13 at 08:42
@RobertMcGibbon But inside ssesum() you wrap the passed in array with np.asarray(). Are you certain this won't create an aligned array ? And are you certain e.g `a[1:-3]` doesn't create a suitable aligned array, even by chance ? I would recon this test is better done in C, where you control the alignment, so you can test with confidence that one array is not 16 byte aligned , and the other is. (does numpy guarantee 16 byte aligment on the arrays, or are you risking your "aligned" data to only be 4 byte aligned ?) — nos, Nov 28 '13 at 08:59
I am sure that the array is not aligned. When I try to use the `aligned=True` command with the `a[1:-3]`, it segfaults the interpreter. And the alignment check, a.ctypes.data % 16 == 0, is done after the asarray command. — Robert T. McGibbon, Nov 28 '13 at 09:49
I translated it to C anyways. The results are basically a wash -- none of the three versions are consistently faster than the others. — Robert T. McGibbon, Nov 28 '13 at 10:25
Re "question text is getting rather long", I'd suggest scrapping all the Python stuff for now, and basing the Q just on your C code. — Oliver Charlesworth, Nov 28 '13 at 10:36
Good idea @OliCharlesworth. I changed the text to focus on the C. — Robert T. McGibbon, Nov 28 '13 at 11:02
Then I wonder if it's architecture dependent. I get consistently better results with the aligned C code on a Core2 Duo machine. (e.g. the numbers 2.810708, 2.932382, 3.309038) Does it make any difference if you pin the program to one core (using `taskset`) ? — nos, Nov 28 '13 at 11:25
Note that this is not completely true anymore since Intel Nehalem. Using aligned or unaligned access (`movaps` or `movups`) on an aligned data results in the same performance. — plasmacel, Nov 17 '16 at 00:22

score 19 · Accepted Answer · answered Nov 28 '13 at 11:38

You have a lot of noise in your results. I re-ran this on a Xeon E3-1230 V2 @ 3.30GHz running Debian 7, doing 12 runs (discarding the first to account for virtual memory noise) over a 200000000 array, with 10 iterations for the i within the benchmark functions, explicit noinline for the functions you provided, and each of your three benchmarks running in isolation: https://gist.github.com/creichen/7690369

This was with gcc 4.7.2.

The noinline ensured that the first benchmark wasn't optimised out.

The exact call being

./a.out 200000000 10 12 $n

for $n from 0 to 2.

Here are the results:

load_ps aligned

min:    0.040655
median: 0.040656
max:    0.040658

loadu_ps aligned

min:    0.040653
median: 0.040655
max:    0.040657

loadu_ps unaligned

min:    0.042349
median: 0.042351
max:    0.042352

As you can see, these are some very tight bounds that show that loadu_ps is slower on unaligned access (slowdown of about 5%) but not on aligned access. Clearly on that particular machine loadu_ps pays no penalty on aligned memory access.

Looking at the assembly, the only difference between the load_ps and loadu_ps versions is that the latter includes a movups instruction, re-orders some other instructions to compensate, and uses slightly different register names. The latter is probably completely irrelevant and the former can get optimised out during microcode translation.

Now, it's hard to tell (without being an Intel engineer with access to more detailed information) whether/how the movups instruction gets optimised out, but considering that the CPU silicon would pay little penalty for simply using the aligned data path if the lower bits in the load address are zero and the unaligned data path otherwise, that seems plausible to me.

I tried the same on my Core i7 laptop and got very similar results.

In conclusion, I would say that yes, you do pay a penalty for unaligned memory access, but it is small enough that it can get swamped by other effects. In the runs you reported there seems to be enough noise to allow for the hypothesis that it is slower for you too (note that you should ignore the first run, since your very first trial will pay a price for warming up the page table and caches.)

score 10 · Answer 2 · answered Nov 28 '13 at 11:37

There are two questions here: Are unaligned loads slower than aligned loads given the same aligned addresses? And are loads with unaligned addresses slower than loads with aligned addresses?

Older Intel CPUs (“older” in this case is just a few years ago) did have slight performance penalties for using unaligned load instructions with aligned addresses, compared to aligned loads with new addresses. Newer CPUs tend not to have this issue.

Both older and newer Intel CPUs have performance penalties for loading from unaligned addresses, notably when cache lines are crossed.

Since the details vary from processor model to processor model, you would have to check each one individually for details.

Sometimes performance issues can be masked. Simple sequences of instructions used for measurement might not reveal that unaligned-load instructions are keeping the load-store units busier than aligned-load instructions would, so that there would be a performance degradation if certain additional operations were attempted in the former case but not in the latter.

Intel CPUs since Nehalem have (almost?) zero penalty for unaligned loads/stores that don't cross a cache-line boundary. I think it's actually zero on Haswell and later, and maybe also on Nehalem and Sandybridge but less sure there. See http://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info. (Of course, unaligned loads without AVX stop you / the compiler from folding loads into memory operands for ALU instructions, which hurts front-end throughput). Store-forwarding may work in more cases with aligned stores. — Peter Cordes, Sep 29 '17 at 07:41
[How can I accurately benchmark unaligned access speed on x86\_64](//stackoverflow.com/q/45128763) goes into some of the details: uop replay to handle the other cache line, for a cache-line-split load/store on Intel. AMD can have slowdowns even for crossing a 32-byte boundary inside a cache line, IIRC. — Peter Cordes, Jan 06 '20 at 12:12

score 7 · Answer 3 · answered Nov 28 '13 at 13:53

See "§2.4.5.1 Efficient Handling of Alignment Hazards" in Intel® 64 and IA-32 Architectures Optimization Reference Manual:

The cache and memory subsystems handles a significant percentage of instructions in every workload. Different address alignment scenarios will produce varying performance impact for memory and cache operations. For example, 1-cycle throughput of L1 (see Table 2-25) generally applies to naturally-aligned loads from L1 cache. But using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access data from L1 will experience varying amount of delays depending on specific microarchitectures and alignment scenarios.

I couldn't copy the table here, it basically shows that aligned and unaligned L1 loads are 1 cycle; split cache line boundary is ~4.5 cycles.

score 6 · Answer 4 · answered Dec 26 '14 at 16:34

This is architecture dependent and recent generations have improved things significantly. On the older Core2 architecture on the other hand:

$ gcc -O3 -fno-inline foo2.c -o a; ./a 1000000 
Array Size: 3.815 MB                    
Trial 1
_mm_load_ps with aligned memory:    0.003983
_mm_loadu_ps with aligned memory:   0.003889
_mm_loadu_ps with unaligned memory: 0.008085
Trial 2
_mm_load_ps with aligned memory:    0.002553
_mm_loadu_ps with aligned memory:   0.002567
_mm_loadu_ps with unaligned memory: 0.006444
Trial 3
_mm_load_ps with aligned memory:    0.002557
_mm_loadu_ps with aligned memory:   0.002552
_mm_loadu_ps with unaligned memory: 0.006430
Trial 4
_mm_load_ps with aligned memory:    0.002563
_mm_loadu_ps with aligned memory:   0.002568
_mm_loadu_ps with unaligned memory: 0.006436
Trial 5
_mm_load_ps with aligned memory:    0.002543
_mm_loadu_ps with aligned memory:   0.002565
_mm_loadu_ps with unaligned memory: 0.006400

This test failed to reveal the performance penalty for `movups` even on aligned data, with Core2. Perhaps the compiler saw the data was guaranteed to be aligned and used `movaps` to implement `_mm_loadu_ps`? (I didn't look at the test source). Anyway, `movups` loads have half the throughput, and `movups` stores have 1/4 the throughput, on Core2. — Peter Cordes, Sep 29 '17 at 07:49

Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

4 Answers4

Linked