Why is SSE aligned read + shuffle slower than unaligned read on some CPUs but not on others?

Question

While trying to optimize misaligned reads necessary for my finite differences code, I changed unaligned loads like this:

__m128 pm1 =_mm_loadu_ps(&H[k-1]);

into this aligned read + shuffle code:

__m128 p0   =_mm_load_ps(&H[k]);
__m128 pm4  =_mm_load_ps(&H[k-4]);
__m128 pm1  =_mm_shuffle_ps(p0,p0,0x90);   // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
       pm1  =_mm_move_ss(pm1,tpm1);        // pack lowest float with 3 others

where H is 16 byte-aligned; and there also was similar change for H[k+1], H[k±3] and movlhps & movhlps optimization for H[k±2] (here's the full code of the loop).

I found that on my Core i7-930 optimization for reading H[k±3] appeared to be fruitful, while adding next optimization for ±1 slowed down my loop (by units of percent). Switching between ±1 and ±3 optimizations didn't change results.

At the same time, on Core 2 Duo 6300 and Core 2 Quad enabling both optimizations (for ±1 and ±3) boosted performance (by tens of percent), while for Core i7-4765T both of these slowed it down (by units of percent).

On Pentium 4 all attempts to optimize misaligned reads, including those with movlhps/movhlps lead to slowdown.

Why is it so different for different CPUs? Is it because of increase in code size so that the loop might not fit in some instruction cache? Or is it because some of CPUs are insensitive to misaligned reads, while others are much more sensitive? Or maybe such actions as shuffles are slow on some CPUs?

Everything is slow on P4, including `movhlps`, `shufps`, even `movaps reg, reg`. It's terrible. I'd just pretend that P4 never existed and focus on Core2 and newer. — harold, Apr 22 '14 at 13:02
Related: [Cacheline splits, take two](http://web.archive.org/web/20120417184641/http://x264dev.multimedia.cx/archives/96), from Dark Shikari's blog (x264 lead developer) for Core2 vs. earlier. `palignr` is for this on Core 2; using 3 shuffles is bad (including the `movss` which is actually a blend but runs on shuffle ports). See also [How can I accurately benchmark unaligned access speed on x86\_64](//stackoverflow.com/q/45128763) for some more links about unaligned load performance on x86; yes it has changed from Core 2 to Haswell, especially for unaligned loads with no cache-line split. — Peter Cordes, May 05 '19 at 05:38

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

Every two years Intel comes out with a new microarchitecture. The number of execution units may change, instructions that previously could only execute in one execution unit may have 2 or 3 available in newer processors. The latency of instruction might change, as when a shuffle execution unit is added.

Intel goes into some detail in their Optimization Reference Manual, here's the link, below I've copied the relevant sections.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

section 3.5.2.7 Floating-Point/SIMD Operands

The MOVUPD from memory instruction performs two 64-bit loads, but requires additional μops to adjust the address and combine the loads into a single register. This same functionality can be obtained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer μops and can be packed into the trace cache more effectively. The latter alternative has been found to provide a several percent performance improvement in some cases. Its encoding requires more instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be used.

Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.

Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.

section 6.5.1.2 Data Swizzling

Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS instructions.

enter image description here

The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of the execution engine. The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit. Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs. MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit shuffle unit.

score 2 · Answer 2 · answered Apr 22 '14 at 12:30

2

On older CPUs misaligned loads have a large performance penalty - they generate two bus read cycles and then there is some additional fix-up after the two read cycles. This means that misaligned loads are typically 2x or more slower than aligned loads. However with more recent CPUs (e.g. Core i7) the penalty for misaligned loads is almost negligible. So if you need so support old CPUs and new CPUs you'll probably want to handle misaligned loads differently for each.

answered Apr 22 '14 at 12:30

Paul R

208,748
37
389
560

But misaligned loads are faster than aligned load+shuffle on Pentium 4, which I think is old enough, isn't it? – Ruslan Apr 22 '14 at 12:50
Yes, I don't know why you're seeing that effect with Pentium 4 - you'd need to look at the latencies of the other instructions to see how they compare with the 2-3 cycle cost of the misaligned load. – Paul R Apr 22 '14 at 13:47

Why is SSE aligned read + shuffle slower than unaligned read on some CPUs but not on others?

2 Answers2

Linked