Agner Fog's instruction tables for Skylake show these two instructions:
MOV r32/64,m 1 1 p23 2 0.5
MOVQ r64,mm/x 1 1 p0 2 1
where each instruction has 1 micro-op in the fused domain, 1 micro-op in the unfused domain, 23 micro-ops each port for MOV and 0 micro-ops each port for MOVQ, latency of 2 for each, and 0.5 vs 1 in the reciprocal throughput column.
My question is, reading these stats, which of these two instructions is faster? Intuitively it seems that the 23 micro-ops each port for MOV could generate a lot more port pressure than zero micro-ops each port. In his definitions section, Fog says "The number of μops that an instruction generates is important when certain bottlenecks in the pipeline limit the number of μops per clock cycle."
Is my interpretation correct - that MOVQ would be faster than MOV? Would it make a difference when the MOV is from stack to register?