In C terms:
For a simple type like int
, aligned load and store functions could look like this:
int load(int *p) { return *p; }
void store(int *p, int val) { *p = val; }
(You'd actually use memcpy
to get unaligned and strict-aliasing-safe loads and stores.)
__m128i
load/store functions mostly exist to communicate aligned vs. unaligned to the compiler, vs. dereferencing __m128i*
directly. And for float
/ double
, they also avoid casts because _mm_loadu_ps
takes a const float*
arg.
Later Intel intrinsics take void*
args, avoiding the need for a _mm_loadu_si128((const __m128i*)&my_struct)
or whatever, but unfortunately they didn't make that improvement until AVX-512 intrinsics.
In asm terms, a load reads data from memory into a register (or as a source operand for an ALU instruction). A store writes data to memory.
C local variables are normally kept in registers, but of course your compiler is free to optimize intrinsic loads/stores the same way it can optimize dereferences of an int *
. e.g. it might optimize away a store/reload so the asm wouldn't contain an instruction to do that.
The fact that there are load and store intrinsics does not mean that __m128i
"is a register". It's like int
; if/when it can be kept in a register, the compiler will do so, but you can make an array of __m128i
or whatever. load/store intrinsics can be optimized away, or a load can be folded into a memory source operand for an ALU instruction like vpaddb
.
Related: