4

I would like to know more about the _mm_lddqu_si128intrinsic (lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) .

I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says

this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary

and a comment says it

will perform better under certain circumstances, but never perform worse.

So why is it not used more (SSE3 is a pretty low bar since all Core2 processors have it)? Why may it perform better when data crosses a cache line? Is lddqu only possibly better on a certain subset of processors. E.g. before Nehalem?

I realize I could read through an Intel manual to probably find the answer but I think this question may be interesting to other people.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    I tried both, and saw no performance difference (tried it on Core i7 2600) – Rotem Jul 14 '16 at 10:35
  • 1
    @Rotem it's possible it's only better before Nehalem (i.e. on systems with SSE3 and SSSE3 but not SSE4.1) but that's just a guess. – Z boson Jul 14 '16 at 11:23
  • 1
    *"I realize I could read through an Intel manual to probably find the answer but I think this question may be interesting to other people."* I upvoted this question, but please let's not get a theme going here... – Cody Gray - on strike Jul 14 '16 at 16:57
  • @harold: nope, Prescott P4 only, not Merom. – Peter Cordes Jul 14 '16 at 20:37
  • Not *exactly* a duplicate, but my answer on the newer question covers this and more. They should be linked together somehow, and I think duplicate works. (I could edit them both with links to the other.) It's funny that I ended up writing mostly the same answer about 1.5 years later about the AVX 256b version. – Peter Cordes Dec 03 '17 at 07:27

1 Answers1

6

lddqu used a different strategy than movdqu on P4, but runs identically on all other CPUs that support it. There's no particular downside (since SSE3 instructions don't take any extra bytes of machine code, and are fairly widely supported even by AMD at this point), but no upside at all unless you care about P4.

Dark Shikari (one of the x264 video encoder lead developers, responsible for a lot of SSE speedups) went into detail about it in a blog post in 2008. This is an archive.org link since the original is offline, but there's a lot of good stuff in his blog.

The most interesting point he makes is that Core2 still has slow unaligned loads, where manually doing two aligned loads and a palignr can be faster, but is only available with an immediate shift count. Since Core2 runs lddqu the same as movdqu, it doesn't help.

Apparently Core1 does implement lddqu specially, so it's not just P4 after all.


This Intel blog post about the history of lddqu/movdqu (which I found in 2 seconds with google for lddqu vs movdqu, /scold @Zboson) explains:

(on P4 only): The instruction works by loading a 32-byte block aligned on a 16-byte boundary, extracting the 16 bytes corresponding to the unaligned access.

Because the instruction loads more bytes than requested, some usage restrictions apply. Lddqu should be avoided on Uncached (UC) and Write-Combining (USWC) memory regions. Also, by its implementation, lddqu should be avoided in situations where store-load forwarding is expected.

So I guess this explains why they didn't just use that strategy to implement movdqu all the time.

I guess the decoders don't have the memory-type information available, and that's when the decision has to be made on which uops to decode the instruction to. So trying to be "smart" about using the better strategy opportunistically on WB memory probably wasn't possible, even if it was desirable. (Which it isn't because of store-forwarding).


The summary from that blog post:

starting from Intel Core 2 brand (Core microarchitecture , from mid 2006, Merom CPU and higher) up to the future: lddqu does the same thing as movdqu

In the other words:
* if CPU supports Supplemental Streaming SIMD Extensions 3 (SSSE3) -> lddqu does the same thing as movdqu,
* If CPU doesn’t support SSSE3 but supports SSE3 -> go for lddqu (and note that story about memory types )

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • How many people still use a x86 computer without SSSE3? If I look at the Steam stats (for whatever they are worth) almost 100% of users have SSE3 but only 90% have SSSE3? Are there really people still using the P4 (or even earlier)? I think you mentioned in a post you still use a P4. – Z boson Jul 18 '16 at 06:40
  • 1
    @Zboson: Most of this was meant as historical-interest, because it's basically not still relevant. The 10% of users without SSSE3 are probably mostly AMD. I've never owned a P4. I had a P-MMX, then some AMD Athlons, then K8, then Core2 (Merom), then SnB. Intel's BIOS update bricked my SnB motherboard, so I'm using my Core2Duo E6600 again until I stop being lazy and decide what Skylake mobo I want to buy. I did see a P4 still in use in a desktop last summer I think it was, in an apartment my brother was sharing with a guy who wasn't a computer geek at all (and had a console for games). – Peter Cordes Jul 18 '16 at 06:48
  • AMD explains it, thanks. I did not realize that AMD skipped from SSE3 (before bulldozer) directly to AVX (bulldozer). AMD had SSE4a before it had SSSE3. I'm surprised a BIOS updated bricked your SnB motherboard. I had a 68k CPU then decided to study physics instead of comp sci and the next computer I bought where I actually knew what the processor was was a 2600k SnB :-) – Z boson Jul 18 '16 at 06:54
  • 1
    @Zboson: Yeah, I was surprised too, and pretty disappointed that Intel left [a dangerous BIOS update for their DZ68DB motherboards](https://communities.intel.com/thread/30773) up on the download page with no warning for all these years. I updated without doing any further reading while trying things to see if they'd make the Linux graphics drivers work better. I think years ago I had seen those warnings and not updated, but I'd forgotten. Heh, I've always been a CPU geek. Even before I knew much asm, I'd read news articles about how they worked, and which was faster. Big surprise there :P – Peter Cordes Jul 18 '16 at 07:02
  • 1
    If the 10% of users without SSSE3 are AMD I assume they have SSE4a (Barcelona microarch). I would guess that `lddqu` is implemented the same as `movdqu`. In which case `lddqu` is for all practical purposes obsolete. Now I understand why SSE3 is the becoming the base and not SSSE3. Probably 99% of Intel users have SSSE3 but many AMD users do not. – Z boson Jul 18 '16 at 07:08
  • It seems strange to me that AMD implemented SSE4a before SSSE3. I wonder why they did this. That's annoying. – Z boson Jul 18 '16 at 07:12
  • 1
    @Zboson: [Intel's anti-competitive practices of not disclosing details with enough lead-time for AMD to implement them](http://www.agner.org/optimize/blog/read.php?i=25) is probably the major reason. Other x86 shenanigans like FMA4 vs FMA3 are almost entirely Intel's fault, so I assume this is, too. – Peter Cordes Jul 18 '16 at 07:15
  • Will Intel hurts itself a bit by doing this because it means that SSE3 becomes the baseline instead of SSSE3. I guess this is why anti-competitive practices are suppose to be discouraged by the government. This is an example of the [Prisoner's dilemma](https://en.wikipedia.org/wiki/Prisoner%27s_dilemma) though some [disagree](http://www.siliconinvestor.com/readreplies.aspx?msgid=23176784). – Z boson Jul 18 '16 at 07:35
  • 2
    @Zboson: Having the baseline be only SSE3 instead of SSSE3 doesn't particularly hurt Intel more than AMD. It hurts Intel's customers, but we're still going to buy their CPUs. If anything, crap like this does benefit Intel because developers are more likely to spend time implementing a speedup for Intel CPUs only than for AMD CPUs only. e.g. even AMD has given up on XOP for future CPUs. (I wish Intel had adopted it; the `vpperm` 2-source byte shuffle fills so many gaps...) And then some software will be even faster on Intel than AMD, widening the perf gap. – Peter Cordes Jul 18 '16 at 08:10
  • 1
    @Zboson: I just realized that I'd made the implicit assumption that hurting the x86 architecture with crap like this doesn't hurt Intel. It is actually possible that ARM or something else will displace x86 more easily, but in the long run the main impact is less compact machine code, and maybe limitations of future extensibility (although the VEX coding space still has tons of room; currently only 1/64th use, IIRC.) So I doubt that this will have much impact on the long-term competitiveness of x86. The people it creates more work for (developers) aren't the ones making HW purchase decisions – Peter Cordes Jul 18 '16 at 08:25
  • 1
    Btw, there's a VEX-encoded `vlddqu` that also works on 32-byte widths. Go figure. I can't tell if that's intentional, or just the result of some stupid hard-wiring that equally promoted all legacy SSE to VEX without regard to usefulness. Same applies to `vpalignr/vpsrldq/vpslldq`. Perhaps it simplifies their k-map reduction for the instruction decoder. – Mysticial Aug 03 '16 at 19:50
  • @Mysticial: even worse, there's an `_mm256_lddqu_si256` intrinsic and [people actually use it](http://stackoverflow.com/a/38575836/224132) /facepalm. IDK why they bothered to build the shuffle hardware for `vpalignr ymm`, though; agreed on it's lack of use-cases with those stupid semantics. There are some instructions that don't have 256bit forms though, only SSE and AVX-128: `movhps`, `pextr*` / `pinsr*`, `PCMPISTRM` (and the other SSE4.2 string insns), `MASKMOVDQU` (not the larger-granularity `VPMASKMOVD/Q`). Also AES-NI, and `PCLMULQDQ`. Decode transistors is still a possibility. – Peter Cordes Aug 03 '16 at 20:06
  • 2
    So `lddqu` finally dies in AVX512. `vpalignr` turns into `valignr` and does the right thing. But `vpsrldq/vpslldq` still do in-lane shifts. I am curious at the use-cases are for the in-lane 128-bit shift. At this stage, it's no longer "free" to keep it around. It's not a subset of any other instruction until Cannonlake's byte-permute. (Though I suppose it can be done cheaply if you generalized the `valign` hardware just a little bit.) – Mysticial Aug 03 '16 at 20:22