Don't all loads result in an L1 cache hit (after the data arrives if it initially missed)?

Question

It is quite obvious that the cache miss rate can be determined by the following formula:

miss_rate = n_misses / n_accesses

I have a doubt regarding how number of misses are counted.

I think a cache miss (es. L1) is treated like this:

miss in L1 [n_misses += 1]
lookup L2 cache
write data in L1
check again L1 [n_hits += 1]

If this is correct, after a miss there will be always a hit. For this reason, the number of hits will be always at least equal to the number of misses. A consequence is that the miss rate cannot be higher than 50%.

However, measuring cache miss rate with, for example, perf there are several situations in which the miss rate is higher than 50%.

what am I missing?

It will do things in parallel for performance reasons (so it won't wait until the data is in the L1 cache to try again). — Erik Eidt, Apr 26 '23 at 18:14
But even if it didn't, that fake hit #4 would not count as a "hit" by any definition I can think of — it isn't a 2nd access ***by the program*** and so cannot count as one, even if micro-architecturally it constitutes another access. That single access by the program was a miss. These stats have to apply to the program behavior, namely its access pattern (regardless of processor), rather than micro-architectural goings on. — Erik Eidt, Apr 26 '23 at 18:14

score 1 · Answer 1 · answered Apr 26 '23 at 15:00

That's not how CPUs work; a load buffer waiting for an incoming line will get the bytes it needs directly. e.g. on Intel CPUs, from the LFB (Line Fill Buffer) waiting for that cache line; LFBs are the buffers that track incoming and outgoing cache lines, allocated on cache miss.

The load producing a value happens in parallel with populating the line in L1d. It doesn't have to replay¹ the load uop to have the load execution unit redo the TLB lookup and check permissions again. That strategy would cost extra latency for L2 hits, as well as some load throughput when you had a mix of loads that hit in L1d vs. L2.

On CPUs where the path between L2 and L1d isn't as wide as a full cache line (e.g. Intel before Skylake, and many earlier CPUs where the data paths might only be 32 bits wide), "early restart" and "critical word first" are also common techniques to reduce L1d miss latency a little: Early Restart lets the load produce a value when the necessary byte(s) arrive, not waiting for the entire cache line to arrive (or be written to L1d). Critical Word First lets the full-line transfer start with the chunk containing the byte(s) wanted by the demand load that missed. (If later loads miss in the same cache line, presumably the first one already initiated the request and its word will arrive first.)

See also:

A random google hit has a short article about early restart and critical word first
if cache miss happens, the data will be moved to register directly or first moved to cache then to register?
which is optimal a bigger block cache size or a smaller one? - early restart and critical-word-first mitigate or completely avoid the direct latency cost of larger cache lines, for anemic CPUs with busses that aren't 64 bytes wide.

Even if you did hypothetically design a simplistic CPU that worked the way you were imagining, you'd probably design its performance counters to not count the load replay as another load that hits in cache. Because that's not what people want to know about, that's just a bad implementation technique for getting incoming data into the execution pipeline. Of course ignoring that a "simplistic" CPU probably wouldn't be designed with performance counters in the first place, so maybe you'd talk about cycle-accurate simulation of it.

Footnote 1: On Intel CPUs, when loads take extra time due to cache misses or other effects, uops waiting for the load data might get replayed. Cache-line-split loads or especially 4k-split loads might even need another cycle on the load port (e.g. to do another TLB lookup for the other half which is potentially coming from another page), but the load uop itself doesn't actually re-dispatch from the RS (scheduler). See also Are load ops deallocated from the RS when they dispatch, complete or some other time?.

Don't all loads result in an L1 cache hit (after the data arrives if it initially missed)?

1 Answers1