0

Agner Fog's instruction tables for Skylake show these two instructions:

MOV r32/64,m 1 1 p23 2 0.5

MOVQ r64,mm/x 1 1 p0 2 1

where each instruction has 1 micro-op in the fused domain, 1 micro-op in the unfused domain, 23 micro-ops each port for MOV and 0 micro-ops each port for MOVQ, latency of 2 for each, and 0.5 vs 1 in the reciprocal throughput column.

My question is, reading these stats, which of these two instructions is faster? Intuitively it seems that the 23 micro-ops each port for MOV could generate a lot more port pressure than zero micro-ops each port. In his definitions section, Fog says "The number of μops that an instruction generates is important when certain bottlenecks in the pipeline limit the number of μops per clock cycle."

Is my interpretation correct - that MOVQ would be faster than MOV? Would it make a difference when the MOV is from stack to register?

RTC222
  • 2,025
  • 1
  • 20
  • 53
  • 2
    The two instructions you listed are pretty different; one is a 32 bit store, the other transfers data from a 64 bit general purpose register to an MMX or SSE register. Apples and oranges. – fuz Aug 27 '20 at 19:35
  • 3
    Also, p23 means that it can execute on ports 2 and 3. It says nothing about the number of µops it takes. – fuz Aug 27 '20 at 19:36
  • Thanks for the clarification on the meaning of p23. As for your first comment, each of them has the end result in a general purpose register, so I don't understand what you mean that one of them is a store. – RTC222 Aug 27 '20 at 19:40
  • 3
    Sorry, I got the operand order wrong; the first is a load from memory, the other a SSE/MMX to general purpose transfer. Quite different instructions. – fuz Aug 27 '20 at 19:42
  • My goal is to move from stack to rax (where stack is a memory location) or from xmm0 to rax. If mov allows use of 2 ports and has a reciprocal throughput of 0.5 (vs 1 for movq) then it seems like mov would be faster, but I may be splitting hairs? – RTC222 Aug 27 '20 at 19:44
  • 1
    In principle but note that these timings do not account for the latency of a memory access which comes at an extra premium. Try to keep data in registers as much as possible. – fuz Aug 27 '20 at 19:52
  • In this case I'm opting to store the data in the xmm registers because stack is actually memory and, while fast, I imagine movq from an xmm register would be faster, even if only by a small amount. The data are buffer pointers, so to use them I have to move them to a gp register. – RTC222 Aug 27 '20 at 19:54
  • 1
    It doesn't answer your question quite directly, but I would suggest writing some C code to do what you want and see which operations the compiler chooses to use. Compilers have a vested interest in choosing the fastest operation when given proper motivation (optimization flags and proper CPU architecture....) – Michael Dorgan Aug 27 '20 at 21:28
  • I've never taken the view the compilers are omniscient. They don't always choose the best instructions, so I investigate it myself. But GCC and Clang are very reliable, just not perfect. I do use your approach in a lot of situations, though. – RTC222 Aug 27 '20 at 21:44
  • 1
    L1 latency (since the stack is presumably in L1) is about 4 clocks, while your information shows moving from xmm is 2 clocks. If the stack read misses L1, the time goes up substantially, of course. – prl Aug 28 '20 at 00:37
  • 1
    Using xmm registers may have significant performance side effects. For example, the OS has to save and restore them if you use them. If the OS does lazy save and restore, then you pay the penalty of a fault the first access after each context switch. This may overwhelm the possible savings. – prl Aug 28 '20 at 00:41
  • @prl - thanks for your comments, particularly the last one. That seems to point to the stack as the better choice. – RTC222 Aug 28 '20 at 00:55
  • 2
    @prl: Modern Linux does "eager" FPU save/restore because even scalar integer code uses some SSE instructions for e.g. 16-byte copies or zeroing, or for memcpy / strlen / etc. library functions. I'd assume other modern OSes are similar. – Peter Cordes Aug 28 '20 at 01:13
  • 1
    @prl: Also yes, Agner's tables unfortunately do *not* reflect L1d load-use latency. He just splits latency arbitrarily between store and load so the store/reload latencies add up to the length of the store-forwarding round trip latency. Load-use latency to XMM regs is maybe 1 cycle higher than for integer, IIRC. (But if the address is ready early, like normal for RSP, OoO exec can hide it. Using memory is usually cheaper than "spilling" to XMM regs, because of port bottlenecks getting data between integer and XMM, and the front-end savings of memory source operands for ALU instructions). – Peter Cordes Aug 28 '20 at 01:14
  • This is a duplicate other than the mistakes of completely mis-interpreting Agner's notation; the table itself has descriptions of the notation at the top of the section or tab for each uarch. As a sanity check, of course a scalar integer load isn't 23 back-end uops; that conflicts with the "1 unfused domain" from another column. – Peter Cordes Aug 28 '20 at 01:28

0 Answers0