I think there is a subtle difference between using _mm_loadu_ps
and _mm_load_ps
even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.
Operations which fold a load and another operation such as multiplication into one instruction can only be done with load
, not loadu
intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.
Consider the following code
#include <x86intrin.h>
__m128 foo(float *x, float *y) {
__m128 vx = _mm_loadu_ps(x);
__m128 vy = _mm_loadu_ps(y);
return vx*vy;
}
This gets converted to
movups xmm0, XMMWORD PTR [rdi]
movups xmm1, XMMWORD PTR [rsi]
mulps xmm0, xmm1
however if the aligned load intrinsics (_mm_load_ps
) are used, it's compiled to
movaps xmm0, XMMWORD PTR [rdi]
mulps xmm0, XMMWORD PTR [rsi]
which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.
vmovups xmm0, XMMWORD PTR [rsi]
vmulps xmm0, xmm0, XMMWORD PTR [rdi]
Therefor for aligned access although there is no difference in performance when using the instructions movaps
and movups
on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.
But there can be a difference in performance when using _mm_loadu_ps
and _mm_load_ps
intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps
vs. movups
, it's between movups
or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov*
load to get the result in a register for reuse.)