While trying to optimize misaligned reads necessary for my finite differences code, I changed unaligned loads like this:
__m128 pm1 =_mm_loadu_ps(&H[k-1]);
into this aligned read + shuffle code:
__m128 p0 =_mm_load_ps(&H[k]);
__m128 pm4 =_mm_load_ps(&H[k-4]);
__m128 pm1 =_mm_shuffle_ps(p0,p0,0x90); // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
pm1 =_mm_move_ss(pm1,tpm1); // pack lowest float with 3 others
where H
is 16 byte-aligned; and there also was similar change for H[k+1]
, H[k±3]
and movlhps
& movhlps
optimization for H[k±2]
(here's the full code of the loop).
I found that on my Core i7-930 optimization for reading H[k±3]
appeared to be fruitful, while adding next optimization for ±1
slowed down my loop (by units of percent). Switching between ±1
and ±3
optimizations didn't change results.
At the same time, on Core 2 Duo 6300 and Core 2 Quad enabling both optimizations (for ±1
and ±3
) boosted performance (by tens of percent), while for Core i7-4765T both of these slowed it down (by units of percent).
On Pentium 4 all attempts to optimize misaligned reads, including those with movlhps
/movhlps
lead to slowdown.
Why is it so different for different CPUs? Is it because of increase in code size so that the loop might not fit in some instruction cache? Or is it because some of CPUs are insensitive to misaligned reads, while others are much more sensitive? Or maybe such actions as shuffles are slow on some CPUs?