1

What is more efficient and why?

Specifically _mm_loadu_si128 vs. _mm_load_si128 in C.

(Editor's note: or this was tagged assembly, possibly they meant movdqu vs. movdqa in hand-written asm. Which is not the same thing, especially without AVX, because _mm_load_si128 can compile into a memory operand for an ALU instruction with no separate movdqa at all.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Opal
  • 471
  • 1
  • 7
  • 20

1 Answers1

11

loadu is used for misaligned loads (from addresses that are not aligned to a 16 byte multiple) and load is used for aligned loads. If you know that your source address is correctly aligned then load would normally be more efficient as it only needs one read cycle and doesn't have to deal with fixing up multiple chunks of misaligned data. On older Intel CPUs the performance penalty for misaligned loads was quite significant (typically > 2x) but on more recent CPUs (e.g. Core i5/i7) the penalty is almost negligible. Note that using loadu for aligned data is OK apart from the aforementioned performance penalty, but using load with misaligned data will generate an exception (i.e. crash).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Or it could mean `load` vs `load unsigned` (as in MIPS: `LB`/`LH` vs `LBU`/`LHU`, where one pair does sign extension when loading a value from memory into a register while the other does zero extension in the process). – Alexey Frunze Apr 12 '13 at 06:44
  • 2
    OP has now clarified - it's x86/SSE – Paul R Apr 12 '13 at 07:22
  • @AlexeyFrunze It doesn't. OP isn't asking for guesswork here. – user207421 Apr 12 '13 at 08:04
  • @EJP My comment was made before the clarification from the OP. – Alexey Frunze Apr 12 '13 at 08:15
  • 2
    Can you (or do you know of any source that can) quantify how small the "negligible" penalty is? [This Intel article](https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) says there was a "degradation of more than 20% when working with misaligned data and using the `loadu` and `storeu` instructions" in some specific case. – Juho Jul 07 '14 at 12:16
  • 1
    @mrm: it really depends on the access pattern and how much computation you are performing relative to I/O. I've seen small throughput hits (10% or less) when using misaligned loads on Core i7, and my hunch is that these result from crossing cache line boundaries and other indirect penalties when data is not correctly aligned. Bottom line: use aligned loads/stores wherever possible, but misaligned loads/stores are not a huge problem when you are forced to use them. – Paul R Jul 07 '14 at 13:06
  • 1
    `loadu` on aligned data can still cost front-end bandwidth, because the compiler can't fold unaligned loads into memory operands for ALU instructions, and has to use a separate `movdqu`. In practice it's often fine, but more uops take more space in the out-of-order window even if you aren't bottlenecked on front-end throughput. (And is less hyperthreading-friendly.) – Peter Cordes Feb 15 '19 at 14:16
  • (`loadu` on aligned data is only a problem without AVX; AVX allows memory source operands to be unaligned, so `loadu` can be "folded" into a `vaddps xmm0, [rdi]` or whatever. But still beware of GCC tune settings that split unaligned stores and/or loads: [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726) - terrible if the data is *usually* aligned, unless you use `-mtune=haswell` or something (implied by `-march=haswell`)) – Peter Cordes Oct 23 '20 at 02:05