Data movement speed between GPR-XMM and Memory-XMM

Question

Suppose:

General Purpose Register (GPR) like r8 is holding value 3.14.
r9 is holding value address of 2.71 in memory.

Which one faster:

This

movq xmm0, r8 //reading 3.14 from r8
movq r8, xmm0 //writing 3.14 to r8

Or this

movsd xmm1, [r9] //reading 2.71 from memory
movsd [r9], xmm1 //writing 2.71 to memory

What I mean faster is R/W access time.

You can check these sorts of things using microarchitectoral tables, e.g. https://uops.info. Note however that in practice, the load/store can be faster than moving from/to a GPR as it doesn't compete for p015 but instead runs on load/store ports. — fuz, Jul 16 '22 at 18:01
Also, please define “faster.” Do you mean “which has the shorter latency?” Or do you mean in terms of throughput? Also, what microarchitecture are you programming for? And is the move part of the critical path? — fuz, Jul 16 '22 at 18:02
What I mean faster is shorter value difference between two nanoseconds, those are fetching instruction until executed in CPU cycle. What does it mean? Latency? Throughput? — Citra Dewi, Jul 16 '22 at 18:13
Modern processors are out of order designs and do many things at once, so the *latency* (number of cycles between instruction execution and result being ready) is not the only important factor to ascertain performance. Throughput means “how many of these instructions can we execute per cycle?” Now as for “fetching instruction until executed (resp. retired),” that's a very high number and can be up to a few hundred cycles regardless of instruction when the CPU is really busy. It's also not really relevant for performance as many hundred instructions can be in flight at the same time. — fuz, Jul 16 '22 at 18:21
You might want to read up on the performance characteristics of out of order processors before revisiting this question. Otherwise a detailed answer cannot really be given. Usually though it's better to do a GPR→XMM/XMM→GPR move. However, it is worse to first load into a GPR and then move the GPR into an XMM register than to load directly into an XMM register. — fuz, Jul 16 '22 at 18:23
https://uops.info/ has latency measurements for those round-trip tests. (And throughput measurements for which throughput resources they consume; store-port uops vs. ALU port.) See also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) / [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) — Peter Cordes, Jul 16 '22 at 18:25
I don't really know what “R/W access time” is supposed to mean. — fuz, Jul 16 '22 at 18:26
If you want the value in both places, usually better to do 2 independent loads. Generally avoid XMM<->GPR transfers if you can; they have very limited throughput. (Like 1c even on Alder Lake which can do 3 loads and 2 stores per clock). — Peter Cordes, Jul 16 '22 at 18:27
@PeterCordes I wrote “it is not better” originally and then meant copy-edited it into “it is worse.” Made a mistake there. It is now fixed. — fuz, Jul 16 '22 at 18:31
@fuz: Tempted to close this as a duplicate of [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) . It would be potentially useful to have a Q&A that discusses the costs of XMM<->GPR transfers if we don't have one already, but this question isn't clear enough to answer, I don't think. Not clear if it's asking about round trip latency for doing those back to back, or each instruction separately, or what surrounding code and use-case we're talking about. It needs framing in terms that make sense (latency vs. front-end vs. back-end uops). — Peter Cordes, Jul 16 '22 at 18:35
Read/Write terms maybe mostly familiar in digital circuit. I watch [CrashCourse](https://youtu.be/FZGugFqdr60). According what I get, read is how data move from memory to register, write is how data move from register to memory. So Read access time means time needed how data move from memory to register. And same how about Write access time. — Citra Dewi, Jul 16 '22 at 18:36
@PeterCordes Let's leave it open. There's a good chance OP might decide on improving his question so it can be answered. — fuz, Jul 16 '22 at 18:36
@fuz: Temporary closure pending improvement is a thing we can do. Mainly holding off because there is arguably some answerable stuff. — Peter Cordes, Jul 16 '22 at 18:39
@CitraDewi Okay, this is commonly called “latency” in out of order processors. Note that it's a nuanced thing: the latency differs a lot depending on where you want to pick up the result or where it comes from (e.g. renamer, register file, L1 cache, L2 cache, L3 cache, RAM on same socket, RAM on other socket, RDMA, swap, network, tape library). Note also that latency on its own is not the main deciding factor for the performance of an instruction. — fuz, Jul 16 '22 at 18:39
@CitraDewi: Store-data latency is not very observable in out-of-order exec CPUs. Only as part of the round-trip latency for reloading soon after. The [store buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) decouples execution of the store uop from commit to L1d cache (at some point after retirement). And L1d is write-back, so it doesn't propagate to shared L3 until after that. (Inter-core latency for another thread reading it is more about the interconnect between cores.) — Peter Cordes, Jul 16 '22 at 18:42
Related: [Assembly - How to score a CPU instruction by latency and throughput](https://stackoverflow.com/q/52260625). And required reading: [**Modern Microprocessors A 90-Minute Guide!**](https://www.lighterra.com/papers/modernmicroprocessors/) covers pipelined and out-of-order exec. — Peter Cordes, Jul 16 '22 at 18:43
Agner Fog's microarch guide (https://agner.org/optimize/) is also relevant: apparently AMD's optimization manual for the obsolete Bulldozer family suggested store/reload for moving data between XMM and GPRs, but Agner said he didn't find that was faster. On other CPUs, movq is lower latency. — Peter Cordes, Jul 16 '22 at 18:45
@PeterCordes Based on your reference how to score a CPU Instruction latency, the answer said measuring. I'm planning to measure by calling syscall current nanosecond between two points with moving value several times. Is that good idea? — Citra Dewi, Jul 16 '22 at 19:01
@CitraDewi This can work, but you have to make sure that the moves do not execute simultaneously (then you'll just measure throughput). This is tricky to get right even for experts. An easy way to get it kinda right is to always use the output of the previous instruction in the next one. This way, the execution cannot be interleaved. It's not perfect though. — fuz, Jul 16 '22 at 19:10
@CitraDewi: You need to put the instructions in a loop to measure either throughput or latency. A `clock_gettime` syscall would take thousands of times longer than one or two executions of `movq`, so you need a repeat-loop to make the timed interval long enough to not be lost in measurement overhead and noise. Using `lfence; rdtsc` could give you a measurement of a very short sequence, but it wouldn't be *meaningful* because that's just waiting for retirement, which is different from the impact as part of a realistic use-case. — Peter Cordes, Jul 16 '22 at 19:39
One other point: if you're moving much data this way, it's usually going to be limited by memory bandwidth anyway, so the form of the instruction won't make much difference. — Jerry Coffin, Jul 17 '22 at 18:17

Data movement speed between GPR-XMM and Memory-XMM

0 Answers0