0

Suppose you have mov rax, [rsi] from debugging. So, how is this instruction actually executed?

Can the address pointed to by the rsi register be the L1 cache in the best case? Or, when the address pointed to by rsi is translated, read data from disk or network rather than from main memory?

So what is the worst case if we consider the following conditions when executing the above instruction?

Latency Comparison Numbers
L1 cache reference
Branch mispredict
L2 cache reference
Mutex lock/unlock
Main memory reference
Compress 1K bytes with Zippy
Send 1K bytes over 1 Gbps network
Read 4K randomly from SSD*
Read 1 MB sequentially from memory
Round trip within same datacenter
Read 1 MB sequentially from SSD*
Disk seek
Read 1 MB sequentially from disk
Send packet CA->Netherlands->CA

initprism
  • 33
  • 3
  • Yes, a load can hit in the L1. A load can directly read from a storage device only if the device has a proper interface for that (NVRAM is like that, ordinary SSD/HDD are not unless you consider reading from the MMIO data register as a read operation). A load cannot read from the network or another host in general, the closest I can think of is RDMA (cfr: InfiniBand) but that's more than a single load. The rest of the question doesn't make any sense to me. I wouldn't waste any time on that data, measuring load latency in ns makes no sense. – Margaret Bloom Jan 26 '23 at 20:34
  • A load from a memory-mapped file can page-fault, making the kernel run some FS and driver code to get the data from disk or over the network. But all that work is happening in the page fault handler; the CPU isn't still running the load. It ended with a #PF exception. And will be retried after the page is present and the kernel returns to user-space with RIP pointing at it. – Peter Cordes Jan 27 '23 at 02:15
  • 1
    The best case for a `mov` from memory is normally an L1d hit, 2 or 3 per clock throughput with about 5 cycle latency. But on a few CPUs the best case is even better, reloading a recent store with zero latency store-forwarding, in the special case where the CPU can match the address. e.g. Zen 2 has it (but not Zen 3), and Ice Lake. https://www.agner.org/forum/viewtopic.php?t=41 – Peter Cordes Jan 27 '23 at 02:20

0 Answers0