Way prediction in modern cache

Question

We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches.

I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the attempt results in a miss, use the normal set-associative search in all the cache lines of the set.

I want to understand how does this way-prediction work. How is the latency of the prediction hardware/logic smaller than the search latency of the complete set?

I think way-prediction is more about reducing power by not fetching all tags and data (for that set) in parallel like a "normal" L1 cache would. A normal L1 cache without way prediction typically compares all tags in parallel, using the result to mux the data from that way to the output. This has only a small amount of latency so way-prediction is usually not done purely for latency reasons, AFAIK. Besides power, I think it can help in cases where the tag isn't ready early, but I forget the details and have never fully grokked way prediction. (like how you'd build it) — Peter Cordes, Oct 03 '20 at 10:58
@Peter Cordes Do you mean the modern processors have hardware resources to perform **all the n searches required for the n-way set associative cache in parallel** and hence there are no latency implications? And it tries to save power by not using all the search hardware/circuit available to them? — jhagk, Oct 03 '20 at 11:15
Yeah, comparing for exact equality is pretty cheap (just vertical XOR => multi-input horizontal OR and check for 0 meaning no mismatches). It's easier to do the tag compares in parallel than to shuffle them 1 at a time into a single comparator, especially given the amount of extra logic you'd need to do that. And you certainly want to fetch all the tags for a set with one access to cache, not keep accessing it repeatedly. e.g. anything online about caches describes number of comparators = ways https://courses.cs.washington.edu/courses/cse378/07au/lectures/L16-More-Cache-Organizations.pdf — Peter Cordes, Oct 03 '20 at 11:33
The only question is whether you fetch the data (from the given cache offset) for each set in parallel with the tags, or whether you wait until you know which way (if any, could be a miss) based on the tag compares. Remember, hardware is naturally parallel, there's no inherent serial model of execution like there is with software, unless you're building an ancient CPU microcoded the way a 6502 or Z80 is. Also somewhat related: [VIPT Cache: Connection between TLB & Cache?](https://stackoverflow.com/q/46480015) describes more about the details of tag + — Peter Cordes, Oct 03 '20 at 11:35
In my first comment, I should have said all caches always compare tags in parallel. Not just L1 caches. — Peter Cordes, Oct 03 '20 at 11:44

score 3 · Answer 1 · 2021-01-05T16:45:47.117

The way prediction mechanism for AMD's Bulldozer and Ryzen families are µtag-based and documented in "Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors" (Moritz Lipp et al., 2020, PDF).

µtag-based way prediction matches a hash of the virtual address rather than a full virtual address, so not only does it avoid the address translation overhead like a virtually tagged cache but by using less storage the prediction array can be accessed with lower latency and the tag checked with slightly lower latency. "Take A Way" reversed engineered that both AMD's Bulldozer family and Ryzen family use bits 12 through 27 for the hash function and that a single xor (⊕) layer is used, which reduces latency. The Bulldozer family used 12⊕21, 13⊕22:, 14⊕23, 15⊕24, 16⊕25, 17⊕26, 18⊕27; the Ryzen family used 12⊕27, 13⊕26, 14⊕25, 15⊕20, 16⊕21, 17⊕22, 18⊕23, 19⊕24.

Two aspects of these µtag hash functions are worth noting. First, by using less significant bits rather than the full 48 valid virtual address bits, all of the bits used in the hash function are available earlier because of reduced carry propagation delay (address generation involves an addition and although high performance adders have log(n) delay the less significant bits will still be available earlier). (This effect also means that the twelve least significant bits used to determine the cache set are available even earlier, so the predictor table can be indexed before the µtag has been computed.) Second, in the Ryzen family, the typically least variable (most significant) bits are xored with the typically most variable (least significant) bits for three bits of the hash; this should reduce the probability of false matches. False matches are handled by replacing the match rather than using the ordinary (LRU-oriented) replacement policy; this will usually result in a higher miss rate.

(Recent Intel x86 processors are also known to use µtag-based way predction.)

Other Way Prediction Examples

Way prediction is not a new technique. POWER6 used a µtag predictor with the 11-bit tags being [14:17].([16:23]⊕[24:31]) for a 64 KiB 8-way cache with 128 B cache lines. ("IBM POWER6 microarchitecture", H.Q. Le et al., 2007). One valid bit per hardware thread were also included to avoid thrashing for homonyms (effective address matches for different address spaces). As with Ryzen, there is clearly a recognition that the least significant bits vary more frequently so the two least significant bits are xored with any other bits.

The Pentium4 also used a µtag predictor. According to "The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology" (Darrell Boggs et al., 2004), the 90nm implementation "significantly increased the size of the partial address match from previous implementations, thus reducing the number of false aliasing cases". Details appear not to have been published.

The MIPS R10000 used a simple MRU-based way predictor for its off-chip two-way associative L2 cache. 8Ki single bit prediction entries were provided to indicate the most recently used cache block of a set. If more than 8 Ki sets were provided (up to 128 Ki sets were supported for a 16 MiB L2 cache with 64 B blocks), different sets would use the same prediction bit (predictor aliasing). This way prediction was used to reduce pin count; only one tag would be read at a time and part of the data block from only one way. The alternatives would be a direct-mapped cache (HP PA-RISC used large off-chip, direct-mapped L1 caches) or specialized (more expensive) chips for handling tag comparison (MIPS R8000 used special tag SRAMs that included tag comparison logic and used the comparison result to address ordinary SRAMs holding the data).

The Alpha 21264 instruction cache used a set and way predictor, which could be viewed as a variation of a branch target buffer. For each aligned chunk of four 4-byte instructions, a prediction of the next line (index) and way was included. If a chunk of instructions included a branch that was taken last time it was executed, that branch's target line and way would be the prediction for that line. Control flow instructions with variable targets (including call returns) and branches that change whether they are taken or not would result in mispredictions, but this predictor's accuracy was usually high.

Latency and Power Considerations

Modern high performance processors primarily use way prediction to reduce access energy while maintaining fast access. With support for 32-byte cache accesses (e.g., for AVX) and fast unaligned loads (which effectively doubles the access size), the energy difference between reading eight ways of data in parallel and (usually) only reading one way of data is substantial. The saving in tag read and compare energy are somewhat reduced by the need to read and compare µtags. (Note that relaxing the latency constraint on the TLB — hit confirmation using physical tags and permission tags can occur after the predicted way data is already being used by execution units — can also be exploited to reduce access energy or increase TLB capacity.)

Direct-mapped caches gain a latency advantage from not having to select the correct way before forwarding the data to execution units. Selecting the correct way includes tag comparison and the multiplexor selection itself. However, if the way determination (or prediction) latency is less than the data access latency, the only added latency for set associativity is the pass-through latency of the "warmed-up" multiplexors. Since tag arrays are much smaller than data arrays, their access latency is much smaller, so it is easier (especially with virtual address tags) to determine the way a little before the data itself is available. (In earlier processors, smaller cache blocks — tag array size closer to data array size — and relatively lower wire delay compared to logic delay would make completing way determination before data availability more difficult and modestly increase the impact of selection delay.)

How did you calculate that one-in-eight chance? Which recent Intel x86 processors are also known to use µtag-based way predction? I've not seen an indication of that. — Hadi Brais, Nov 30 '20 at 20:14
Willamette doesn't have staggered AGUs, only staggered ALUs, that's why in my edit I added "in Northwood+." — Hadi Brais, Dec 10 '20 at 18:34
@HadiBrais I vaguely recall reading about Intel using way prediction on the Real World Technologies Forum, but I did not find anything with a Google search. If I cannot find a confirmation, I will probably just delete that part (though I think such is a "standard" technique). Sadly, microarchitectural details are often treated as trade secrets. — , Dec 10 '20 at 18:40
@HadiBrais Ugh. I will have do some research and then edit. (Using Internet access from a (closed) local library's garage does not make this convenient, so it may be a while before I make corrections.) If Willamette did not use a staggered AGU, perhaps it used early available bits; just one more thing to research. Thank you for the typo edits and the fact checks. — , Dec 10 '20 at 18:49

Way prediction in modern cache

1 Answers1

Other Way Prediction Examples

Latency and Power Considerations

Linked