Does L1 cache accept new incoming requests while its Line Fill Buffers (LFBs) are fully exhausted?

Question

I wonder if L1 cache still receives new requests that hit L1D, making forward progress for the pipeline when Line Fill Buffers (LFBs or MSHRs) get full?

Or could anybody help me write a microbenchmark that can tell me if it does or not?

I know how to saturate LFBs (i.e. stride indirect accesses, hash table, ...) and there are a few useful performance counters available in modern Intel CPUs to measure L1 MLP and to count FB full events but still couldn't figure out if it does so or not.

Peter Cordes · Answer 1 · 2022-12-01T06:21:36.253

Loads can hit in L1d regardless of LFB, no interaction.

I'd be shocked if load execution units couldn't hit in L1d while all LFBs were occupied, even waiting for other loads. A load has to snoop LFBs (to see recent NT stores, or regular stores that have committed to an LFB while waiting for an RFO under the limited conditions where that can happen without violating the memory-ordering rules), but a load doesn't need to allocate one on an L1d hit.

It's always fun to try to test an assumption, though, and in fact a good idea to test theory with experiment. In this case, I think I've found fairly good evidence that L1d load hits can still happen while all LFBs are full (with partial-line NT stores).

Sorry this answer is a bit rambling, with the code at the bottom. I played around with a bunch of iteration counts before choosing which one to actually show, and didn't go back to tidy up the wording. So you can maybe get some experience of the process of thinking up an experiment and refining it, if there's any upside to this. :P

L1d load hits have very high throughput (2/clock from SnB onward, 3/clock in Alder Lake). But it would be hard to distinguish a bottleneck on that from a bottleneck on whatever runs out of LFBs. Perhaps looking at latency of L1d loads in a pointer-chasing scenario like mov rax, [rax] could more easily detect lost cycles progress without staying far from other throughput limits? (And making the limited RS / ROB size "last longer" in terms of cycles to sneak some stores in.)

Or maybe we should avoid trying to push close to having all LFBs occupied as a steady state condition, because trying to balance that with a load latency bottleneck would be hard to distinguish from the stores on their own just becoming a real throughput bottleneck.

Instead do a burst of NT stores occasionally, or something else that will occupy all 10, 12, or however many LFBs you have in whatever generation of Intel CPU. With the store buffer as well to absorb that burst of NT stores, we can fill all the LFBs some of the time, without expecting to create an overall throughput bottleneck or any bubbles in the latency dep-chain. So the CPU can absorb the burst and have the front-end get back to issuing uops from our dep chain.

NT stores are a good choice: they need LFBs until being handed off, and partial-line NT stores that we never complete will sit in an LFB until evicted to make room for more. (When NT stores do write all the bytes in a cache line, it flushes itself; that's the normal use-case.)

perf stat measures the whole program, but with no other code running in user-space, startup overhead is minimal. Only a couple page faults. Letting it run for a good while, close to a second, means the few milliseconds for clock speed to jump to full is negligible.

On i7-6700k Skylake (with DDR4-2666 memory) on Arch GNU/Linux, with energy_performance_preference = balance_performance, it only goes to 3.9GHz, not 4.2 for sustained periods, which keeps the fans near-silent. But it ramps to that speed very quickly, and can maintain it on all cores indefinitely, so interrupts on other cores and stuff don't disturb things.

Tested with partial-line NT stores to 32 contiguous lines. (As a burst of store activity between 100 iters x 8 reps of mov/imul dep chain, 100x 17 uops of that loop). See the source at the bottom of this answer. I later ended up going with a somewhat shorter dep chain so bursts of store activity could overlap with for more of the total run time without being so long it stalls. So if they were going to have an effect, it would be more visible.

$ asm-link -dn "$t".asm -DNTSTORE_ITERS=18
+ nasm -felf64 -Worphan-labels lfb-test.asm -DNTSTORE_ITERS=18
+ ld -o lfb-test lfb-test.o
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,ld_blocks_partial.address_alias,ld_blocks.store_forward -r3 ./"$t"

 Performance counter stats for './lfb-test' (3 runs):

          1,647.24 msec task-clock                #    1.000 CPUs utilized            ( +-  0.02% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    1.214 /sec                   
     6,421,016,156      cycles                    #    3.897 GHz                      ( +-  0.00% )
     1,895,000,506      instructions              #    0.30  insn per cycle           ( +-  0.00% )
           113,936      exe_activity.bound_on_stores #   69.158 K/sec                    ( +- 50.67% )
           163,512      resource_stalls.sb        #   99.250 K/sec                    ( +- 44.22% )
                 0      ld_blocks_partial.address_alias #    0.000 /sec                   
                 0      ld_blocks.store_forward   #    0.000 /sec                   

          1.647758 +- 0.000279 seconds time elapsed  ( +-  0.02% )

So 6421M cycles instead of 6400M means we're just barely getting to the point where OoO exec starts to lose a bit of progress on the load/imul dep chain, maybe due to limited RS (scheduler) size. (See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for analysis of this kind of impact on a long dep chain).

The 0 ld_blocks counts show that I successfully avoided 4k aliasing with the way I chose addresses for the pointer-chasing mov rax,[rax] vs. the buffer.

Store part only

We can test the stores alone to make sure they'd take a non-negligible fraction of the total time if there wasn't overlap. We want to verify that the store part of the workload isn't like 100x faster than the ALU dep chain, in which case it might be lost in the noise even if it was stalling the latency dep chain.

I edited the load/imul chain to use mov ecx,1 and %rep 0, so just one not-taken dec/jnz.

# no latency dep-chain
# NTSTORE_ITERS=16  (32 NT stores to 32 cache lines)

$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            411.00 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.863 /sec                   
     1,601,892,487      cycles                    #    3.895 GHz                      ( +-  0.00% )
        87,000,133      instructions              #    0.05  insn per cycle           ( +-  0.00% )
     1,567,641,964      exe_activity.bound_on_stores #    3.812 G/sec                    ( +-  0.01% )
     1,567,641,964      resource_stalls.sb        #    3.812 G/sec                    ( +-  0.01% )
               405      br_misp_retired.all_branches_pebs #  984.826 /sec                     ( +- 10.91% )
            16,606      int_misc.recovery_cycles_any #   40.380 K/sec                    ( +-  8.02% )

          0.411499 +- 0.000250 seconds time elapsed  ( +-  0.06% )

Total cycles scales linearly with -DNTSTORE_ITERS=n from about 9 upward, with exe_activity.bound_on_stores and resource_stalls.sb essentially equal to cycles.

The last two counters are measuring branch misses, and total front-end cycles lost to re-steer and other recovery from things like branch misses.
Branch misses are usually negligible at 19 inner loop iterations or lower, but at 21 or higher we get almost exactly one mispredict per outer-loop iteration, i.e. the last inner iteration every time.

For NTSTORE_ITERS=6 or lower, it's vastly faster (14M cycles for 1M outer iterations = 12M NT stores), which makes sense because Skylake has 12 LFBs. NT stores hit in the same partial LFB, not needing to evict anything, so there's no off-core bottleneck. n=7 (14 lines) takes ~390M cycles, n=8 (16 lines) takes ~600M +- 30M cycles. For n=10 (20 lines) we get 990M cycles.

This extremely fast speed with n=6 holds up when the load dep chain is running. e.g. latency chain of ecx=1 rep 2, store work of n=6. Outer iteration count bumped up by a factor of 100. Total time = 1600M cycles, 2.56 IPC. vs. 1400M cycles with the dep chain shortened even more, just bound on store throughput. I think if loads were disturbing LFBs at all, that would make it much slower. I don't know why it takes 14 cycles for 12 NT stores.

# dep chain: ECX=1 / %rep 2
# stores: NTSTORE_ITERS=6  (12 lines, same numbers of LFBs)
# outer iterations: 100M instead of the 1M in other tests.

 Performance counter stats for './lfb-test' (3 runs):

            410.56 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 2      page-faults               #    4.868 /sec                   
     1,600,236,855      cycles                    #    3.895 GHz                      ( +-  0.00% )
     4,100,000,135      instructions              #    2.56  insn per cycle           ( +-  0.00% )
            92,188      exe_activity.bound_on_stores #  224.404 K/sec                    ( +- 54.94% )
       675,039,043      resource_stalls.sb        #    1.643 G/sec                    ( +-  0.01% )

So to occupy all the LFB for most of the time, we should be using at least 20 cache lines, might as well go for 32 (n=16). It's short enough not to cause branch misses, or to fill up the store buffer or scheduler if we give it time to drain in between. But long enough to be way more than the number of LFBs, so we certainly have lots of cycles where they're all occupied.

IDK if it's just a coincidence of core clock and memory clock, but that n=16 (32 NT stores) case takes almost exactly half the time of the load / ALU dep chain I created. With 1M outer iterations doing 32 NT stores each, that's about 1602 cycles per 32 NT stores, or 50 cycles per partial-line NT store in terms of throughput cost. They execute on the execution units at 1/clock, so a burst of 32 of them can get into the store buffer really quickly compared to how long it takes one to commit.

(Of course, there are buffers at other levels of the cache hierarchy, like the "superqueue" between L2 and the ring bus. So when NT stores are coming in bursts, they first of them can likely hand off faster than that. Except it won't even try until it's getting evicted as a partial-line write.)

Anyway, n=16 for 32 cache lines touched takes half the time of the ALU dep chain, when doing just the stores. And it's bursty enough that it's almost certainly occupying all the LFBs for a decent fraction of that 50% "duty cycle" of store bursts.

Certainly they'd be occupied for well over the couple percent slowdown we see when we're doing this in parallel with the load/imul chain. That dep chain needs to complete a load every 8 cycles, and can't "catch up" in bursts. Any time a load address is ready but the load doesn't execute that cycle, throughput is lost and can't be caught up, because that's how critical path latency bottlenecks work.

Unless the CPU reserves an LFB for loads, if they somehow need one. I think that's unlikely.

Reducing the ALU dep chain so it's also 16M cycles long, same length as the store throughput bottleneck with n=16, combined they still overlap perfectly. This presumably needs all the LFBs to maintain that store throughput, which is pretty solid evidence that they're independent.

Matched bottlenecks: Latency and store-throughput overlap near perfectly

# dep chain iters = 10  x  %rep 20         - alone takes 1.6G cycles
# NTSTORE_ITERS=16                         - alone takes 1.602G cycles
#                                        together taking 1.621G cycles
$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            416.10 msec task-clock                #    0.997 CPUs utilized            ( +-  0.15% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.797 /sec                   
     1,621,634,284      cycles                    #    3.890 GHz                      ( +-  0.02% )
       505,000,135      instructions              #    0.31  insn per cycle           ( +-  0.00% )
           575,600      exe_activity.bound_on_stores #    1.381 M/sec                    ( +- 75.50% )
         1,298,930      resource_stalls.sb        #    3.116 M/sec                    ( +- 47.96% )
             1,376      br_misp_retired.all_branches_pebs #    3.301 K/sec                    ( +-113.51% )
            94,101      int_misc.recovery_cycles_any #  225.719 K/sec                    ( +-256.14% )

          0.417209 +- 0.000704 seconds time elapsed  ( +-  0.17% )

With the inner iterations twice as long, so they each execute about 3200M cycles on their own (just load/imul or just stores), DNTSTORE_ITERS=29 is fine, still 3289M cycles. And n=31 gives 3565M cycles. But bumping up to n=32 (64 cache lines) makes performance fall off a cliff: 4920M cycles. I don't know what causes this; maybe some kind of ROB-size or store-buffer limit? exe_activity.bound_on_stores and resource_stalls.sb didn't go up dramatically.

NASM source for Linux static executable

Build with nasm -felf64 lfb-test.asm -DNTSTORE_ITERS=16 && ld -o lfb-test lfb-test.o

The counts constants in this are what I used for the final test that showed near-perfect overlap with dep chain and store throughput both being 1600 cycles per outer iter. Earlier perf experiments were from versions with %rep 40 for the dep chain, or for mov ecx,100 and %rep 8 in the first perf output with 6,421,016,156 cycles.

global _start
_start:
 and     rsp, -4096         ; help avoid 4k aliasing between load chain and stores
 mov     rax, rsp           ; do our pointer chasing far from the buffer, overwriting argc
 mov    [rax], rax
 vpaddd  ymm0, ymm1, ymm2     ; sometimes unwritten registers can be weird

 mov ebp, 1000000    ; outer repeat count

.loop:
  mov   ecx, 10       ; low iter count to avoid a mispredict
  .inner:
  %rep 20             ; unroll 20x (5+3 cycles) = 160 cycle dep chain
   mov  rax, [rax]
   imul rax, rax, 1   ; lengthen the dep chain without memory access.  And defeat the special case load latency thing in some Intel CPUs so it's always 5 cycles
  %endrep
   dec  ecx
   jnz  .inner

%ifndef NTSTORE_ITERS
%define NTSTORE_ITERS 16
%endif
  mov  ecx, NTSTORE_ITERS
  lea  rdi, [rel buf+64]            ; start at 2nd cache line of the page to avoid 4k aliasing unless we go really far
  .store_burst:                     ; 16 x2 cache lines of NT stores
   vmovntdq [rdi+ 0], ymm0        
   ;vmovntdq [rdi+32], ymm0          ; partial line NT stores take much longer to release their LFB, so we get more consumption for fewer uops
   vmovntdq [rdi+64], ymm0
   ;vmovntdq [rdi+96], ymm0
   add  rdi, 128
   dec  rcx
   jnz  .store_burst

 dec ebp
 jnz .loop


 mov  eax, 231       ; Linux _NR_exit_group
 xor  edi, edi
 syscall             ; _exit(0)

section .bss
 align 4096
 buf: resb 128 * 4096

I probably didn't need to use AVX2; a legacy SSE movntdq [rdi+64], xmm0 would have worked just as well, writing the first 16 instead of 32 bytes of a cache line.

Useful perf counter events (descriptions from `perf list`)

exe_activity.bound_on_stores - [Cycles where the Store Buffer was full and no outstanding load].
If the CPU catches up on the load chain while the store buffer is full, we'll get counts for this. If there's room for the front-end to issue more loads/imuls after getting back to that part of the loop.
resource_stalls.sb - [Cycles stalled due to no store buffers available. (not including draining form sync)]
Counts I think when the front-end can't alloc/rename a store because there aren't any store buffer entries left. (Yes, those are allocated during issue/rename, not when the store executes. That I think implies that even a misaligned store only uses one store buffer entry, with the extra handling happening during TLB check and when committing to cache)
ld_blocks_partial.address_alias - [False dependencies in MOB due to partial compare on address]
This is the 4k aliasing that I wanted to avoid as a confounding factor.
br_misp_retired.all_branches - [All mispredicted macro branch instructions retired]
count how many branch instructions missed
int_misc.recovery_cycles_any - [Core cycles the allocator was stalled due to recovery from earlier clear event for any thread running on the physical core (e.g. misprediction or memory nuke)

Count front-end penalty for branch misses (and any other stalls) - as long as it's low, it's not the reason for anything running slow.

Thanks a lot for the detailed response. I just saw this although it's been given a while ago. — ykwon, Jun 01 '22 at 03:17
I've also quickly checked the code you posted with a few other performance counters that use to measure MLP. "l1d_pend_miss.fb_full" indeed reports good portion of total cycles as intended whereas "l1d_pend_miss.pending" and "l1d_pend_miss.pending_cycles" didn't seem to account for NT stores but just loads. — ykwon, Jun 01 '22 at 03:19
Your test indicates we can still probe the cache when its FBs are fully occupied (somehow) but I still wonder if there could be any differences if FBs are allocated for load misses, WC accesses, or NT stores (like yours). Thanks a lot for your effort again and I will look more into it. — ykwon, Jun 01 '22 at 03:23
@ykwon: Ah, well spotted. Makes sense there'd be a counter for some LFB-related things. But yeah, `l1d_pend_miss.fb_full` only counts L1d load misses that can't even get their request to L2 right away because of a lack of LFBs. (So it would increase L2 hit latency). It's not a general event for tracking the state of the LFBs, just for loads that are affected by it. (Load latency is much more important than stores, because the store buffer decouples OoO exec from cache access. If something's making your loads higher latency, you want to know what. But stores are more just average tput.) — Peter Cordes, Jun 01 '22 at 03:29
@ykwon: You're right that my testing only attempts to fill LFBs with NT stores. I don't expect it to be different with demand loads. But you could certainly try; if you do, you could post an answer to your own question with the results. Probably my benchmark can work fairly similarly. Disabling HW prefetch (with an MSR) might be helpful to limit memory parallelism to the number of LFBs, not # of L2 superqueue entries, to make it easier match bottlenecks. And so L2 doesn't get ahead during gaps between load bursts (small enough to fit in RS and load buffers). Or just use larger bursts. — Peter Cordes, Jun 01 '22 at 03:37
@ykwon: Hmm, there are actually potential difficulties, since oldest-ready-first scheduling of uops means that the latency critical path load uops might not run even if there's a free cycle on an execution unit. The scheduler doesn't know which loads are going to hit in cache and which will miss. If the load bursts are short enough that the latency dep chain is always the oldest load uop not yet sent to an execution unit, we should be ok. — Peter Cordes, Jun 01 '22 at 03:42
But that's what we want anyway, just a burst that temporarily exhausts the LFBs, but clears out before we run out of pending loads on the latency critical path. It might make things more fiddly trying to match bottlenecks like I was able to do with partial-line NT stores (especially since the higher throughput will mean the CPU is clearing the backlog before you finish generating it). But you can use `l1d_pend_miss.fb_full` to prove that some of the burst loads are waiting on LFBs without having to get too close to matching bottlenecks. — Peter Cordes, Jun 01 '22 at 03:45

Does L1 cache accept new incoming requests while its Line Fill Buffers (LFBs) are fully exhausted?

1 Answers1

Loads can hit in L1d regardless of LFB, no interaction.

Store part only

Matched bottlenecks: Latency and store-throughput overlap near perfectly

NASM source for Linux static executable

Useful perf counter events (descriptions from `perf list`)

Linked

Does L1 cache accept new incoming requests while its Line Fill Buffers (LFBs) are fully exhausted?

1 Answers1

Loads can hit in L1d regardless of LFB, no interaction.

Store part only

Matched bottlenecks: Latency and store-throughput overlap near perfectly

NASM source for Linux static executable

Useful perf counter events (descriptions from perf list)

Linked

Useful perf counter events (descriptions from `perf list`)