How long does it take to fill a cache line?

Question

Assuming a cache line is 64 bytes,

100 nanoseconds is the often quoted figure for main memory access, is this figure for 1 byte at a time or for 64 bytes at a time?

Peter Cordes · Accepted Answer · 2022-08-14T05:44:48.493

It's for a whole cache line, of course.

The busses / data-paths along the way are at least 8 bytes wide at every point, with the external DDR bus being the narrowest. (Possibly also the interconnect between sockets on a multi-core system.)

The "critical word" of the cache line might arrive a cycle or two before the rest of it on some CPUs, maybe even 8 on an ancient Pentium-M, but on many recent CPUs the last step between L2 and L1d is a full 64 bytes wide. To make best use of that link (for data going either direction), I assume the L2 superqueue waits to receive a full cache line from the 32-byte ring bus on Intel CPUs, for example.

Skylake for example has 12 Line Fill Buffers, so L1d cache can track cache misses on up to 12 lines in flight at the same time, loads+stores. And the L2 Superqueue has a few more entries than that, so it can track some additional requests created by hardware prefetching. Memory-level parallelism (as well as prefetching) is very important in mitigating the high latency of cache misses, especially demand loads that miss in L3 and have to go all the way to DRAM.

For some actual measurements, see https://www.7-cpu.com/cpu/Skylake.html for example, for Skylake-client i7-6700 with dual-channel DDR4-2400 CL15.

Intel "server" chips, big Xeons, have significantly higher memory latency, enough that it seriously reduces the memory (and L3) bandwidth available to a single core even if the others are idle. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Although I haven't heard if this has improved much with Ice Lake-server or Sapphire Rapids; it was quite bad when they first switched to a mesh interconnect (and non-inclusive L3) in Skylake-server.

DDR5 DIMMs have two channels, so the 64-bit data interface is only 32-bits per channel. (This allows a single burst with the new longer length of 16 to provide a 64-byte access.) (I thought High Bandwidth Memory and/or Hybrid Memory Cube used a narrower interface to the processor, but that may be a false memory. I also think HBM used single-direction signalling, which would prefer asymmetric read/write bandwidth. The few pages I looked at only gave information about the internal width and bandwidth.) — , Aug 14 '22 at 15:33
Ice Lake Xeon significantly improves single-threaded memory bandwidth when using streaming stores. On Cascade Lake Xeon, streaming stores *reduce* single-thread STREAM Copy & Scale performance by ~17%, with minimal changes to Add & Triad performance. On Ice Lake Xeon, streaming stores *increase* Copy & Scale performance by ~65%, and *increase* Add & Triad performance by ~35%. Ice Lake Xeon is about 12% faster than Cascade Lake Xeon on each test when using normal allocating stores. AMD Milan remains much faster than ICX in these single-thread cases -- 1.8x to 2.2x for each category. — John D McCalpin, Aug 16 '22 at 19:30
Thanks for that data point on recent hardware. [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) discusses some details of NT (or other no-RFO) vs. regular RFO store protocols, and the difference that can make in bandwidth on server parts. If you want to repost your comment there, or possibly even post a quick answer with that recent data, that might be a good place to share it. That Q&A is one I often link as a canonical about RFOs and NT store advantages / disadvantages, so future readers might be likely to find stuff there. — Peter Cordes, Aug 16 '22 at 21:29

How long does it take to fill a cache line?

1 Answers1