Loads can hit in L1d regardless of LFB, no interaction.
I'd be shocked if load execution units couldn't hit in L1d while all LFBs were occupied, even waiting for other loads. A load has to snoop LFBs (to see recent NT stores, or regular stores that have committed to an LFB while waiting for an RFO under the limited conditions where that can happen without violating the memory-ordering rules), but a load doesn't need to allocate one on an L1d hit.
It's always fun to try to test an assumption, though, and in fact a good idea to test theory with experiment. In this case, I think I've found fairly good evidence that L1d load hits can still happen while all LFBs are full (with partial-line NT stores).
Sorry this answer is a bit rambling, with the code at the bottom. I played around with a bunch of iteration counts before choosing which one to actually show, and didn't go back to tidy up the wording. So you can maybe get some experience of the process of thinking up an experiment and refining it, if there's any upside to this. :P
L1d load hits have very high throughput (2/clock from SnB onward, 3/clock in Alder Lake). But it would be hard to distinguish a bottleneck on that from a bottleneck on whatever runs out of LFBs. Perhaps looking at latency of L1d loads in a pointer-chasing scenario like mov rax, [rax]
could more easily detect lost cycles progress without staying far from other throughput limits? (And making the limited RS / ROB size "last longer" in terms of cycles to sneak some stores in.)
Or maybe we should avoid trying to push close to having all LFBs occupied as a steady state condition, because trying to balance that with a load latency bottleneck would be hard to distinguish from the stores on their own just becoming a real throughput bottleneck.
Instead do a burst of NT stores occasionally, or something else that will occupy all 10, 12, or however many LFBs you have in whatever generation of Intel CPU. With the store buffer as well to absorb that burst of NT stores, we can fill all the LFBs some of the time, without expecting to create an overall throughput bottleneck or any bubbles in the latency dep-chain. So the CPU can absorb the burst and have the front-end get back to issuing uops from our dep chain.
NT stores are a good choice: they need LFBs until being handed off, and partial-line NT stores that we never complete will sit in an LFB until evicted to make room for more. (When NT stores do write all the bytes in a cache line, it flushes itself; that's the normal use-case.)
perf stat
measures the whole program, but with no other code running in user-space, startup overhead is minimal. Only a couple page faults. Letting it run for a good while, close to a second, means the few milliseconds for clock speed to jump to full is negligible.
On i7-6700k Skylake (with DDR4-2666 memory) on Arch GNU/Linux, with energy_performance_preference = balance_performance, it only goes to 3.9GHz, not 4.2 for sustained periods, which keeps the fans near-silent. But it ramps to that speed very quickly, and can maintain it on all cores indefinitely, so interrupts on other cores and stuff don't disturb things.
Tested with partial-line NT stores to 32 contiguous lines. (As a burst of store activity between 100 iters x 8 reps of mov/imul dep chain, 100x 17 uops of that loop). See the source at the bottom of this answer. I later ended up going with a somewhat shorter dep chain so bursts of store activity could overlap with for more of the total run time without being so long it stalls. So if they were going to have an effect, it would be more visible.
$ asm-link -dn "$t".asm -DNTSTORE_ITERS=18
+ nasm -felf64 -Worphan-labels lfb-test.asm -DNTSTORE_ITERS=18
+ ld -o lfb-test lfb-test.o
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,ld_blocks_partial.address_alias,ld_blocks.store_forward -r3 ./"$t"
Performance counter stats for './lfb-test' (3 runs):
1,647.24 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
2 page-faults # 1.214 /sec
6,421,016,156 cycles # 3.897 GHz ( +- 0.00% )
1,895,000,506 instructions # 0.30 insn per cycle ( +- 0.00% )
113,936 exe_activity.bound_on_stores # 69.158 K/sec ( +- 50.67% )
163,512 resource_stalls.sb # 99.250 K/sec ( +- 44.22% )
0 ld_blocks_partial.address_alias # 0.000 /sec
0 ld_blocks.store_forward # 0.000 /sec
1.647758 +- 0.000279 seconds time elapsed ( +- 0.02% )
So 6421M cycles instead of 6400M means we're just barely getting to the point where OoO exec starts to lose a bit of progress on the load/imul dep chain, maybe due to limited RS (scheduler) size. (See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for analysis of this kind of impact on a long dep chain).
The 0 ld_blocks
counts show that I successfully avoided 4k aliasing with the way I chose addresses for the pointer-chasing mov rax,[rax]
vs. the buffer.
Store part only
We can test the stores alone to make sure they'd take a non-negligible fraction of the total time if there wasn't overlap. We want to verify that the store part of the workload isn't like 100x faster than the ALU dep chain, in which case it might be lost in the noise even if it was stalling the latency dep chain.
I edited the load/imul chain to use mov ecx,1
and %rep 0
, so just one not-taken dec/jnz.
# no latency dep-chain
# NTSTORE_ITERS=16 (32 NT stores to 32 cache lines)
$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"
411.00 msec task-clock # 0.999 CPUs utilized ( +- 0.06% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
2 page-faults # 4.863 /sec
1,601,892,487 cycles # 3.895 GHz ( +- 0.00% )
87,000,133 instructions # 0.05 insn per cycle ( +- 0.00% )
1,567,641,964 exe_activity.bound_on_stores # 3.812 G/sec ( +- 0.01% )
1,567,641,964 resource_stalls.sb # 3.812 G/sec ( +- 0.01% )
405 br_misp_retired.all_branches_pebs # 984.826 /sec ( +- 10.91% )
16,606 int_misc.recovery_cycles_any # 40.380 K/sec ( +- 8.02% )
0.411499 +- 0.000250 seconds time elapsed ( +- 0.06% )
Total cycles scales linearly with -DNTSTORE_ITERS=n
from about 9 upward, with exe_activity.bound_on_stores
and resource_stalls.sb
essentially equal to cycles
.
The last two counters are measuring branch misses, and total front-end cycles lost to re-steer and other recovery from things like branch misses.
Branch misses are usually negligible at 19 inner loop iterations or lower, but at 21 or higher we get almost exactly one mispredict per outer-loop iteration, i.e. the last inner iteration every time.
For NTSTORE_ITERS=6
or lower, it's vastly faster (14M cycles for 1M outer iterations = 12M NT stores), which makes sense because Skylake has 12 LFBs. NT stores hit in the same partial LFB, not needing to evict anything, so there's no off-core bottleneck. n=7 (14 lines) takes ~390M cycles, n=8 (16 lines) takes ~600M +- 30M cycles. For n=10 (20 lines) we get 990M cycles.
This extremely fast speed with n=6 holds up when the load dep chain is running. e.g. latency chain of ecx=1 rep 2, store work of n=6. Outer iteration count bumped up by a factor of 100. Total time = 1600M cycles, 2.56 IPC. vs. 1400M cycles with the dep chain shortened even more, just bound on store throughput. I think if loads were disturbing LFBs at all, that would make it much slower. I don't know why it takes 14 cycles for 12 NT stores.
# dep chain: ECX=1 / %rep 2
# stores: NTSTORE_ITERS=6 (12 lines, same numbers of LFBs)
# outer iterations: 100M instead of the 1M in other tests.
Performance counter stats for './lfb-test' (3 runs):
410.56 msec task-clock # 0.999 CPUs utilized ( +- 0.06% )
2 page-faults # 4.868 /sec
1,600,236,855 cycles # 3.895 GHz ( +- 0.00% )
4,100,000,135 instructions # 2.56 insn per cycle ( +- 0.00% )
92,188 exe_activity.bound_on_stores # 224.404 K/sec ( +- 54.94% )
675,039,043 resource_stalls.sb # 1.643 G/sec ( +- 0.01% )
So to occupy all the LFB for most of the time, we should be using at least 20 cache lines, might as well go for 32 (n=16). It's short enough not to cause branch misses, or to fill up the store buffer or scheduler if we give it time to drain in between. But long enough to be way more than the number of LFBs, so we certainly have lots of cycles where they're all occupied.
IDK if it's just a coincidence of core clock and memory clock, but that n=16 (32 NT stores) case takes almost exactly half the time of the load / ALU dep chain I created. With 1M outer iterations doing 32 NT stores each, that's about 1602 cycles per 32 NT stores, or 50 cycles per partial-line NT store in terms of throughput cost. They execute on the execution units at 1/clock, so a burst of 32 of them can get into the store buffer really quickly compared to how long it takes one to commit.
(Of course, there are buffers at other levels of the cache hierarchy, like the "superqueue" between L2 and the ring bus. So when NT stores are coming in bursts, they first of them can likely hand off faster than that. Except it won't even try until it's getting evicted as a partial-line write.)
Anyway, n=16 for 32 cache lines touched takes half the time of the ALU dep chain, when doing just the stores. And it's bursty enough that it's almost certainly occupying all the LFBs for a decent fraction of that 50% "duty cycle" of store bursts.
Certainly they'd be occupied for well over the couple percent slowdown we see when we're doing this in parallel with the load/imul chain. That dep chain needs to complete a load every 8 cycles, and can't "catch up" in bursts. Any time a load address is ready but the load doesn't execute that cycle, throughput is lost and can't be caught up, because that's how critical path latency bottlenecks work.
Unless the CPU reserves an LFB for loads, if they somehow need one. I think that's unlikely.
Reducing the ALU dep chain so it's also 16M cycles long, same length as the store throughput bottleneck with n=16, combined they still overlap perfectly. This presumably needs all the LFBs to maintain that store throughput, which is pretty solid evidence that they're independent.
Matched bottlenecks: Latency and store-throughput overlap near perfectly
# dep chain iters = 10 x %rep 20 - alone takes 1.6G cycles
# NTSTORE_ITERS=16 - alone takes 1.602G cycles
# together taking 1.621G cycles
$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"
416.10 msec task-clock # 0.997 CPUs utilized ( +- 0.15% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
2 page-faults # 4.797 /sec
1,621,634,284 cycles # 3.890 GHz ( +- 0.02% )
505,000,135 instructions # 0.31 insn per cycle ( +- 0.00% )
575,600 exe_activity.bound_on_stores # 1.381 M/sec ( +- 75.50% )
1,298,930 resource_stalls.sb # 3.116 M/sec ( +- 47.96% )
1,376 br_misp_retired.all_branches_pebs # 3.301 K/sec ( +-113.51% )
94,101 int_misc.recovery_cycles_any # 225.719 K/sec ( +-256.14% )
0.417209 +- 0.000704 seconds time elapsed ( +- 0.17% )
With the inner iterations twice as long, so they each execute about 3200M cycles on their own (just load/imul or just stores), DNTSTORE_ITERS=29
is fine, still 3289M cycles. And n=31 gives 3565M cycles. But bumping up to n=32 (64 cache lines) makes performance fall off a cliff: 4920M cycles. I don't know what causes this; maybe some kind of ROB-size or store-buffer limit? exe_activity.bound_on_stores
and resource_stalls.sb
didn't go up dramatically.
NASM source for Linux static executable
Build with nasm -felf64 lfb-test.asm -DNTSTORE_ITERS=16
&& ld -o lfb-test lfb-test.o
The counts constants in this are what I used for the final test that showed near-perfect overlap with dep chain and store throughput both being 1600 cycles per outer iter. Earlier perf experiments were from versions with %rep 40
for the dep chain, or for mov ecx,100
and %rep 8
in the first perf output with 6,421,016,156
cycles.
global _start
_start:
and rsp, -4096 ; help avoid 4k aliasing between load chain and stores
mov rax, rsp ; do our pointer chasing far from the buffer, overwriting argc
mov [rax], rax
vpaddd ymm0, ymm1, ymm2 ; sometimes unwritten registers can be weird
mov ebp, 1000000 ; outer repeat count
.loop:
mov ecx, 10 ; low iter count to avoid a mispredict
.inner:
%rep 20 ; unroll 20x (5+3 cycles) = 160 cycle dep chain
mov rax, [rax]
imul rax, rax, 1 ; lengthen the dep chain without memory access. And defeat the special case load latency thing in some Intel CPUs so it's always 5 cycles
%endrep
dec ecx
jnz .inner
%ifndef NTSTORE_ITERS
%define NTSTORE_ITERS 16
%endif
mov ecx, NTSTORE_ITERS
lea rdi, [rel buf+64] ; start at 2nd cache line of the page to avoid 4k aliasing unless we go really far
.store_burst: ; 16 x2 cache lines of NT stores
vmovntdq [rdi+ 0], ymm0
;vmovntdq [rdi+32], ymm0 ; partial line NT stores take much longer to release their LFB, so we get more consumption for fewer uops
vmovntdq [rdi+64], ymm0
;vmovntdq [rdi+96], ymm0
add rdi, 128
dec rcx
jnz .store_burst
dec ebp
jnz .loop
mov eax, 231 ; Linux _NR_exit_group
xor edi, edi
syscall ; _exit(0)
section .bss
align 4096
buf: resb 128 * 4096
I probably didn't need to use AVX2; a legacy SSE movntdq [rdi+64], xmm0
would have worked just as well, writing the first 16 instead of 32 bytes of a cache line.
Useful perf counter events (descriptions from perf list
)
exe_activity.bound_on_stores
- [Cycles where the Store Buffer was full and no outstanding load].
If the CPU catches up on the load chain while the store buffer is full, we'll get counts for this. If there's room for the front-end to issue more loads/imuls after getting back to that part of the loop.
resource_stalls.sb
- [Cycles stalled due to no store buffers available. (not including draining form sync)]
Counts I think when the front-end can't alloc/rename a store because there aren't any store buffer entries left. (Yes, those are allocated during issue/rename, not when the store executes. That I think implies that even a misaligned store only uses one store buffer entry, with the extra handling happening during TLB check and when committing to cache)
ld_blocks_partial.address_alias
- [False dependencies in MOB due to partial compare on address]
This is the 4k aliasing that I wanted to avoid as a confounding factor.
br_misp_retired.all_branches
- [All mispredicted macro branch instructions retired]
count how many branch instructions missed
int_misc.recovery_cycles_any
- [Core cycles the allocator was stalled due to recovery from earlier
clear event for any thread running on the physical core (e.g.
misprediction or memory nuke)
Count front-end penalty for branch misses (and any other stalls) - as long as it's low, it's not the reason for anything running slow.