Is there any situation where using MOVDQU and MOVUPD is better than MOVUPS?

Question

I was trying to understand the different MOV instructions for SSE on intel x86-64.

According to this you should use aligned instructions (MOVAPS, MOVAPD and MOVDQA) when moving data between 2 registers, using the correct one for the type you're operating with. And use MOVUPS/MOVAPS when moving register to memory and vice-versa, since type does not impact performance when moving to/from memory.

So is there any reason to use MOVDQU and MOVUPD ever? Is the explanation I got on the link wrong?

I *think* it might matter for load-use latency on some CPUs, but I haven't tested / don't remember what I read (I may most an answer later). MOVUPD is always useless, because no CPU cares about double vs. single float, but some may have an extra bypass-delay when using the result of a MOVUPS load as an input to an integer vector instruction. If you look at compiler output, some compilers always use MOVU/APS for stores, but still use the matching type for loads. — Peter Cordes, Nov 29 '16 at 00:44
Type doesn't impact performance when moving from/to memory, but if you load a value with `movups` and then perform integer operations on it, there is a penalty. This is why both integer-typed and floating-point-typed move instructions exist. — fuz, Nov 29 '16 at 00:50
So if I load something from memory to xmm1 with movdqu, and then I do a floating point operation with xmm1, there's a penalty? — Damian Pereira, Nov 29 '16 at 05:12
@DamianPereira Exactly. That's why you should always use type-appropriate `mov` instructions. — fuz, Dec 25 '16 at 11:04
Note that the link you reference about SSE move performance is rather old and may only apply to older generations of — Grigory Rechistov, Feb 09 '18 at 15:12

score 4 · Answer 1 · edited Feb 11 '18 at 02:53

Summary: I am not aware of any recent x86 architecture that incurs additional delays when using the the "wrong" load instruction (i.e., a load instruction followed by an ALU instruction from the opposite domain).

Here's what Agner has to say about bypass delays which are the delays you might incur when moving between the various execution domains with in the CPU (sometimes these are unavoidable, but sometimes they may be caused by using the "wrong" version of an instruction which is at issue here):

Data bypass delays on Nehalem On the Nehalem, the execution units are divided into five "domains":

The integer domain handles all operations in general purpose registers. The integer vector (SIMD) domain handles integer operations in vector registers. The FP domain handles floating point operations in XMM and x87 registers. The load domain handles all memory reads. The store domain handles all memory stores. There is an extra latency of 1 or 2 clock cycles when the output of an operation in one domain is used as input in another domain. These so-called bypass delays are listed in table 8.2.

There is still no extra bypass delay for using load and store instructions on the wrong type of data. For example, it can be convenient to use MOVHPS on integer data for reading or writing the upper half of an XMM register.

The emphasis in the last paragraph is mine and is the key part: the bypass delays didn't apply to Nehalem load and store instructions. Intuitively, this makes sense: the load and store units are dedicated for the entire core and will have to make their result available in a way suitable for any execution unit (or store it in the PRF) - unlike the ALU case the same concerns with forwarding aren't present.

Now don't really care about Nehalem any more, but in the sections for Sandy Bridge/Ivy Bridge, Haswell and Skylake you'll find a note that the domains are as discussed for Nehalem, and that there are fewer delays overall. So one could assume that the behavior where loads and stores don't suffer a delay based on the instruction type remains.

We can also test it. I wrote a benchmark like this:

bypass_movdqa_latency:
    sub     rsp, 120
    xor     eax, eax
    pxor    xmm1, xmm1
.top:
    movdqa  xmm0, [rsp + rax] ; 7 cycles
    pand    xmm0, xmm1        ; 1 cycle
    movq    rax, xmm0         ; 1 cycle
    dec     rdi
    jnz     .top
    add     rsp, 120
    ret

This loads a value using movdqa, does an integer domain operation (pand) on it, and then moves it to general purpose register rax so it can be used as part of the address for movdqa in the next loop. I also created 3 other benchmarks identical to the above, except with movdqa replaced with movdqu, movups and movupd.

The results on Skylake-client (i7-6700HQ with recent microcode):

** Running benchmark group Vector unit bypass latency **
                     Benchmark   Cycles
  movdqa [mem] -> pxor latency     9.00
  movdqu [mem] -> pxor latency     9.00
  movups [mem] -> pxor latency     9.00
  movupd [mem] -> pxor latency     9.00

In every case the rountrip latency was the same: 9 cycles, as expected: 6 + 1 + 2 cycles for the load, pxor and movq respectively.

All of these tests are added in uarch-bench in case you would like to run them on any other architecture (I would be interested in the results). I used the command line:

./uarch-bench.sh --test-name=vector/* --timer=libpfc

`paddd` or other specifically integer instruction might have been a better choice than `pxor`. On Skylake, the integer and FP vector booleans are basically the same, so those execution units are presumably connected to both forwarding networks anyway. (And latency depends on which port the instruction happens to pick, when used between FP instructions.) — Peter Cordes, Feb 11 '18 at 02:51
Also, SKL bypass latency is funky: it matters how a register value was set even when it's cold in the PRF, and adds latency to the *other* operand. e.g. `addps xmm0, xmm1` has higher latency from xmm0->xmm0 if xmm1 came from integer. IDK why domain-crossing still matters long after any "bypassing" is finished. — Peter Cordes, Feb 11 '18 at 02:52
I imagine that the whole integer vs. float domain thing within the SIMD units is going to disappear in a couple generations. They all use the same execution units. And I don't see a point to splitting up the register file. If anything, we might start seeing port-based bypass delays. (We already see that with the port5 FMA on Skylake X.) — Mysticial, Feb 12 '18 at 20:10
@PeterCordes - I think it's much more likely that the "binary op" execution units are simply present in both the FP and Integer sides simultaneously - after all, these operations are more or less trivial and are probably just a small part of some other circuit. That seems more feasible than a special "in the middle" unit that somehow avoids the inter-domain forwarding latency. I used `pandb` (not `pxor` - the description was wrong) because it conveniently allows me to zero out the value loaded from memory in the benchmark (avoiding the need to suffer store forwarding for part of the loop). — BeeOnRope, Feb 12 '18 at 21:56
Anyways, I [changed the test](https://github.com/travisdowns/uarch-bench/commit/0c4e467043d16dd955eaf09249a2f189f5ec2467) to use `paddb` now (same result on Skylake). — BeeOnRope, Feb 12 '18 at 21:58
@Mysticial - I'm not sure what you mean: different operations are certainly using different execution units. The FMA unit is absolutely distinct from the integer shuffle unit. Of course various units are used for multiple instructions, but there are certainly distinct units. There is no splitting of the register file: all vector regs are available to both FP and Integer domain operations. In any case, the reason for the domains is IMO is more related to forwarding than register file use: it is probably too hard to do same-cycle among all the giant integer and FP execution units... — BeeOnRope, Feb 12 '18 at 22:05
@BeeOnRope I meant that the Int and FP stuff share a lot of the same units. For example, all SIMD integer multiply goes into the FMA unit on Skylake. All the single-cycle logic as well as shuffling is symmetrical between the two units and there's no penalty to mixing them anymore. So in the past there was Int vs. FP domain where there was always a penalty to move data across, now that is fading way and getting replace with execution unit and port-based domains regardless of whether the instruction is Int or FP. — Mysticial, Feb 12 '18 at 22:59
For example, anything that goes into the port5 FMA will normally have an additional latency since it takes an extra cycle to get between the register file and the (far away) port5 FMA. But that latency doesn't apply if the data doesn't need to travel because it's already there. — Mysticial, Feb 12 '18 at 23:00
@Mysticial - right, integer `mul` is kind of special case since SIMD `mul` units are expensive and it makes sense to share them and DP mul can be pretty easily re-used for 32-bit integer mul (that's probably also why you don't get any 64-bit integer mul: such can't run on a 53-bit mantissa FP mul). The simple (e.g., btiwise) ops are probably just duplicated acorss units. So then there isn't even all that much expensive integer stuff left (but AVX512 brings more) - so maybe the domains become "shuffle" (which is expensive) and FP (FMA is expensive)? — BeeOnRope, Feb 12 '18 at 23:05

score 1 · Answer 2 · answered Feb 09 '18 at 15:15

1

Note that the link you reference about SSE move performance is rather old and may only apply to older generations of Intel hardware. I've learned that recent microarchitectures improved performance of e.g. unaligned load instructions in cases they are used for actually aligned data. All in all, a short benchmark is your best source of valid information applicable to the particular piece of hardware you have.

answered Feb 09 '18 at 15:15

Grigory Rechistov

2,104
16
25

Yes, it's true that unaligned loads on aligned data now has no extra cost. But the question is whether there's a domain-crossing vec-fp to vec-int penalty (like latency) for loading integer data with `movups` (1 byte shorter than `movdqu`) before using an instruction like `paddd` on it. The might be for loads, but AFAIK there's no penalty for stores. Or at least some compilers choose to use `movups` stores regardless of what instructions created the vector. – Peter Cordes Feb 10 '18 at 02:28

Is there any situation where using MOVDQU and MOVUPD is better than MOVUPS?

2 Answers2

Linked

Related