35

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?

How many cycles does an L1 cache hit take? How does it compare to register access?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user541686
  • 205,094
  • 128
  • 528
  • 886
  • 1
    It varies by processor, but I don't know of any where it's *quite* as fast as a register -- around 1 to 5 clocks slower is fairly typical. – Jerry Coffin Apr 23 '12 at 03:15
  • 3
    I don't know of any architectures where L1 has single-cycle latency. Also, I don't know of any x86 architectures where register access has a measurable latency in itself (some latency may be perceived due to other factors). – harold Apr 24 '12 at 12:07
  • 1
    See http://www.7-cpu.com/cpu/Haswell.html: some per-cache and per-TLB latency numbers, and some experimental numbers. See also [Agner Fog's microarch pdf](http://agner.org/optimize/), and other links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info). Haswell's L1 load-use latency is 4 cycles, which is typical of modern x86 CPUs. Store-reload latency is 5 cycles, and unrelated to cache hit or miss (it's store-forwarding, not cache). As harold says, register access is 0 cycles (e.g. `inc eax` has 1 cycle latency, `inc [mem]` has 6 cycle latency (ALU + store-forwarding). – Peter Cordes Aug 24 '16 at 21:52

4 Answers4

47

Here's a great article on the subject:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1

To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)

PS:

The specifics will vary, but this link has some good ballpark figures:

Approximate cost to access various caches and main memory?

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns 

PPS:

These figures represent much older, slower CPUs, but the ratios basically hold:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2

Level                    Access Time  Typical Size  Technology    Managed By
-----                    -----------  ------------  ---------     -----------
Registers                1-3 ns       ?1 KB          Custom CMOS  Compiler
Level 1 Cache (on-chip)  2-8 ns       8 KB-128 KB    SRAM         Hardware
Level 2 Cache (off-chip) 5-12 ns      0.5 MB - 8 MB  SRAM         Hardware
Main Memory              10-60 ns     64 MB - 1 GB   DRAM         Operating System
Hard Disk                3M - 10M ns  20 - 100 GB    Magnetic     Operating System/User
Community
  • 1
  • 1
paulsm4
  • 114,292
  • 17
  • 138
  • 190
  • 5
    How is it possible that accessing L3 cache can take 100-300 cycles, while local DRAM access only takes about 120 cycles. Does that mean that L3 cache can be more than two times slower than DRAM, which is used in main memory? – user2316602 Aug 10 '16 at 17:52
  • @user2316602: seems bogus to me, too, unless that table row is supposed to be for the L3 cache of a CPU in a different socket. (It's a Nehalem Xeon system, so main memory and L3 are NUMA.) – Peter Cordes Aug 24 '16 at 21:32
  • L3 (and DRAM) latency is lower on Intel CPUs with fewer cores, like dual or quad-core i7: fewer hops on the ring bus and simpler uncore. See https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug. The biggest Xeons have significantly worse L3 hit latency than this table for Woodcrest. – Peter Cordes Mar 26 '18 at 12:42
14

Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)

Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.

Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.


For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.


Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.

Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.


Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).

As harold commented, register access is 0 cycles. So, for example:

  • inc eax has 1 cycle latency (just the ALU operation)
  • add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
  • mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)

http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).

Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).

  • L1 Data cache = 32 KB, 64 B/line, 8-WAY.

  • L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.

  • L2 cache = 256 KB, 64 B/line, 8-WAY

  • L3 cache = 8 MB, 64 B/line

  • L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])

  • L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).

  • L2 Cache Latency = 12 cycles

  • L3 Cache Latency = 36 cycles

  • RAM Latency = 36 cycles + 57 ns

The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.

@paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • For some reason, I never see the L1i latency value on these sites. It was 2 cycles on P6 for a hit/ITLB hit, is it still 2 cycles on later microarchitectures? I hope so. – Lewis Kelsey Jan 31 '21 at 04:43
  • @LewisKelsey: Good question, but IDK. I doubt it's stayed that low latency with clock frequencies climbing the way they have, and with 32KiB / 8-way size (same as pre-IceLake L1d). Buffering between stages, and good branch prediction, can help hide bubbles even in high-throughput code. Also, the hottest code typically runs from the uop cache, meaning L1i hit latency doesn't matter in many cases. I'd expect 4 or 5 cycle latency, perhaps 3 if it helps that it can be read-only single-ported, and not need to support unaligned loads. And not need to probe the store buffer. – Peter Cordes Jan 31 '21 at 04:51
  • @LewisKelsey: Without a statement from the vendor, it's pretty hard to measure. Very hard to separate other length-of-pipeline / re-steer effects with actual L1i latency. In practice branch-miss recovery time is what you can measure, for uop-cache hit vs. uop-cache miss + L1i hit. – Peter Cordes Jan 31 '21 at 04:52
  • actually, a late BPU clear causes a 3 cycle bubble on Westemere, and this appears to happen at the ILD stage. That suggests if it can on the high edge of cycle 5 resteer a new IP into the low edge of the first cycle, and then there's a 3 cycle bubble (between cycle 1 and 5), this implies there is room for 4 cycles before the ILD, so maybe it is 4 for a regular hit actually. I can't find any diagrams for the cache lookup pipestages, but maybe some of those original clocks are now split up into 2 due to the faster clock speeds. – Lewis Kelsey Jan 31 '21 at 05:11
1

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.

EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

Jesus Ramos
  • 22,940
  • 10
  • 58
  • 88
  • Just checking, you're talking about a *hit* and not a *miss*, right? – user541686 Apr 23 '12 at 03:16
  • Yes, TAG access takes 2 cycles alone I believe, and the rest of the time is from cache access and loading. – Jesus Ramos Apr 23 '12 at 03:17
  • @Mehrdad I gave info for L2, my bad updated with correct info. – Jesus Ramos Apr 23 '12 at 03:20
  • I kind of suspected something was weird. :) Thanks. – user541686 Apr 23 '12 at 04:09
  • The faster the CPU is clocked, the more cycles it takes for the same amount of real time. Modern CPUs have an L1 load-use latency of more like 4 cycles (Intel Haswell). (i.e. cycles/iteration for a loop containing `mov eax, [eax]`, with a pointer that points to itself.) See the top of http://www.7-cpu.com/cpu/Haswell.html for some numbers. – Peter Cordes Aug 24 '16 at 21:46
1

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.

As result:

1) I think you can consider L1 cache as an unmanaged processor registers pool.

2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

ZarathustrA
  • 3,434
  • 32
  • 28
  • Your benchmark was wrong, then, or bottlenecked on something else. `inc [mem]` has 6c latency on Intel Haswell, and similar on AMD. `inc eax` has 1 cycle latency on all modern x86 CPUs. That's store-forwarding latency, not L1 latency. L1 load-use latency is more like 4 cycles. See Agner Fog's microarch pdf, and other links on the [x86 tag wiki](http://stackoverflow.com/tags/x86/info). – Peter Cordes Aug 24 '16 at 21:38
  • @peter-cordes: Not necessarily. It would be wrong if I would want to measure the latency of instruction execution (how many cycles particular instruction spends on CPU pipeline before retirement). However, I aimed to identify how significant the difference in performance penalty between register-based and memory-based variables is on the execution of regular application code. Superscalar pipelined CPU with advanced branch prediction amortizes differences between instructions with different latencies almost entirely. – ZarathustrA Sep 02 '21 at 01:08
  • Furthermore, I can speculate that memory-touching instructions have more latency than register-based counterparts due to more complex decoding and involvement of address generation units into instruction processing but not due to the cache access. – ZarathustrA Sep 02 '21 at 01:08
  • Instruction latency is how long before a dependent instruction can use the result. That doesn't mean waiting until retirement, because *all* instructions are speculative in an out-of-order exec CPU. In a long-running loop, the CPU can't hide the latency of loop-carried dependency chains (i.e. that connect across iterations). e.g. [Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?](https://stackoverflow.com/q/58884297) – Peter Cordes Sep 02 '21 at 01:28
  • 1
    `looptop:` / `inc [mem]` / `dec ecx`/`jnz looptop` will run at about 1 iteration per 6 cycles, bottlenecked on store-forwarding on most recent x86. But with `inc edx` (and *no* store/reload bottlenecks in the whole loop), it can run 1/clock. Perhaps you used inline asm inside a loop, and didn't enable optimization, so the compiler created a loop around your asm that bottlenecked on a memory-destination increment. In that case yeah, you can't gain much by avoiding memory because the bottleneck is still there. – Peter Cordes Sep 02 '21 at 01:30
  • Another example: [Why dependency in a loop iteration can't be executed together with the previous one](https://stackoverflow.com/q/54047936). Things can get interesting on real CPUs where store-forwarding latency isn't constant: [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) Or when Sandybridge-family CPUs have to replay load uops: [Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?](https://stackoverflow.com/q/54084992) – Peter Cordes Sep 02 '21 at 01:33
  • See also https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/ re: finding data dependencies in loops. Also [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) – Peter Cordes Sep 02 '21 at 01:40
  • All of this is why compilers aggressively keep variables in registers, especially in loops, and why un-optimized code (gcc/clang -O0, MSVC debug mode) has *different* bottlenecks than normally optimized code. – Peter Cordes Sep 02 '21 at 01:45
  • Found a concrete example of a case where an extra store/reload as part of a loop carried dependency chain made the loop slower - 9 cycles per iteration vs. 3 for `addsd` on Haswell vs. `addsd` + store/reload: [Visual Studio C++ compiler generates 3x slower code when changing completely unrelated code](https://stackoverflow.com/q/38768000) – Peter Cordes Sep 02 '21 at 01:46
  • @peter-cordes: `looptop:` / `inc [mem]` / `dec ecx/jnz looptop` is almost the same loop that I had. And running this loop took almost the same time as `looptop:` / `inc eax` / `dec ecx/jnz looptop`. Just a few percent of difference on 4 billion iterations. On a real machine. I agree that on another CPU, I could observe other effects because all such staff heavily depends on CPU microarchitecture. – ZarathustrA Sep 02 '21 at 01:49
  • But I suggest that modern CPUs, due to the pipelining, superscalarity, branch prediction, speculative execution, register renaming, will effectively amortize the difference between memory-based and register-based instructions for most of the 'average' code. – ZarathustrA Sep 02 '21 at 01:49
  • What CPU were you testing on? Was it Zen or Ice Lake? Those have a new feature to do zero-latency store forwarding in special cases, at least when the address is ready ahead of the store-data. (https://www.agner.org/forum/viewtopic.php?t=41 / https://en.wikichip.org/wiki/amd/microarchitectures/zen#Stack_Engine). That's a special case, and most CPUs don't have that (yet); I was forgetting about that possibility in my earlier comments. On new hybrid CPUs (like Alder Lake) with low-power "efficiency" cores, those cores won't have it either. Most x86 servers are still Intel Skylake or older. – Peter Cordes Sep 02 '21 at 02:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236666/discussion-between-zarathustra-and-peter-cordes). – ZarathustrA Sep 02 '21 at 02:33
  • *amortize the difference* - The key thing is the length of dependency chains. If there's a long dependency chain across loop iterations, no amount of independent work can speed that up, you can just fill those otherwise-idle cycles. (e.g. by unrolling with multiple accumulators.) And yes, of course I'm talking about modern superscalar pipelined out-of-order execution CPUs, such as Intel Haswell (https://www.realworldtech.com/haswell-cpu/) and Skylake. – Peter Cordes Sep 02 '21 at 02:36