1

Intel architecture has had 64 byte caches for a long time. I am curious, if instead of 64-byte cache lines a processor had 32-byte or 16-byte cachelines, would this improve the RAM-to-register data transfer latency? if so, how much? if not, why?

Thank you.

1 Answers1

2

Transferring a larger amount of data of course increases the communication time. But the increase is very small due the way memory are organized and it does it does not impact memory to register latency.

Memory access operations are done in three steps:

  1. bitline precharge: row address is sent and the internal busses of memory are precharged (duration tRP)
  2. row access: an internal row of a memory is read and written to internal latches. During that time, column address is sent (duration tRCD)
  3. column access: the selected columns are read in the row latches and start to be sent to the processor (duration tCL)

Row access is a long operation. A memory is a matrix of cell elements. To increase the capacity of memory, cells must be rendered as small as possible. And when reading a row of cells, one has to drive a very capacitive and large bus that goes along a memory column. The voltage swing is very low and there are sense amplifiers amplifiers to detect small voltage variations.

Once this operation is done, a complete row is memorized in latches and reading them can be fast and are generally sent in burst mode.

Considering a typical DDR4 memory, with a 1GHz IO cycle time, we generally have tRP/tRCD/tCL=12-15cy/12-15cy/10-12cy and the complete time is around 40 memory cycles (if processor frequency is 4GHz, this is ~160 processor cycles). Then data is sent in burst mode twice per cycle, and 2x64 bits are sent every cycle. So, data transfer adds 4 cycles for 64 bytes and it would add only 2 cycles for 32 bytes.

So reducing cache line from 64B to 32B would reduce the transfer time by ~2/40=5%

If row address do not change, precharging and reading memory row is not required and the access time is ~15 memory cycles. In that case, the relative increase of time for transferring 64B vs 32B is larger but still limited: ~2/15~15%.

Both evaluations do not take into account the extra time required to process a miss in the memory hierachy and the actual percentage will be even smaller.

Data can be sent "critical word first" by the memory. If processor requires a given word, the address of this word is sent to memory. Once the row is read, memory sends first this word, then the other words in the cache line. So, caches can serve processor request as soon as the first word is received, whatever cache line is, and decreasing line width would have no impact on cache latency. So if using this feature, memory-to-register time would not change.

In recent processors, exchanges between different caches levels are based on the cache line width and sending the critical word first does not bring any gain.

Besides that, large line sizes reduce mandatory misses thanks to spatial locality and reducing line size would have a negative impact on cache miss rate.

Last, using larger cache lines increases data transfer rate between cache and memory.

The only negative aspect of large cache lines (besides the small transfer time increase) are that the number of lines in the cache is reduced and conflict misses may increase. But with the large associativity of modern caches, this effect is limited.

Alain Merigot
  • 10,667
  • 3
  • 18
  • 31
  • 1
    *reduce the transfer time by 5%* But that's not the total cache-miss latency: you're leaving out time for the CPU to figure out that it was an L3 cache miss, getting the request down the memory hierarchy to the memory controller and back up again. – Peter Cordes Apr 12 '19 at 09:55
  • 1
    Re: early restart = resuming execution when the critical word arrives, and critical word first : This is great in CPUs with internal busses much narrower than a cache line. But probably on modern Intel CPUs, (Haswell and later where the bus from L2 <-> L1d cache is 64 bytes wide), there's probably no point to critical-word-first. The entire line arrives in the same cycle. BeeOnRope and I discussed this not too long ago, but I can't find it with google. :/ – Peter Cordes Apr 12 '19 at 10:12
  • Our conclusion was that even apart from a full-line bus, HW prefetch so on probably mean there's often no way to do CW first in a modern high-end out-of-order execution CPU. You can have multiple outstanding misses on the same line, too, and the "critical" one might be the one whose address was ready 2nd. So probably not worth the complexity for many high-end CPUs where a whole line only takes a couple internal cycles to send over internal busses once it finally arrives at the memory controller from DRAM. (And the latency * bandwidth product is high for the memory subsystem as a whole.) – Peter Cordes Apr 12 '19 at 10:15
  • All the given times are clearly approximate. You are right concerning the cache miss extra time and the actual latency will be larger by maybe 40 processor cycles (ie ~10 memory cycles), but it is almost impossible to get precise information on these timings. This does not change the overall idea on the small impact of large cache lines. Concerning the critical word issue, all internal transfer are line wide, it is indeed likely that it is no longer used. – Alain Merigot Apr 12 '19 at 10:26
  • Even on Skylake (client), https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Individual_Core shows the ring bus between cores is only 32 bytes wide, so L3<->L2 transfers take 2 cycles to send a line. And yeah, load-use latency for a cache miss is dominated by DRAM latency, but the ~42 core clocks is not totally insignificant https://www.7-cpu.com/cpu/Skylake.html. And it's worse on many-core Xeons with more hops on the ring bus before you get to the memory controller. (Measurably worse latency, and single-thread *bandwidth* is worse, too.) Nice edit, though. – Peter Cordes Apr 12 '19 at 10:43