What is more efficient and why?
Specifically _mm_loadu_si128
vs. _mm_load_si128
in C.
(Editor's note: or this was tagged assembly, possibly they meant movdqu
vs. movdqa
in hand-written asm. Which is not the same thing, especially without AVX, because _mm_load_si128
can compile into a memory operand for an ALU instruction with no separate movdqa
at all.)