You could just do a big memset
; bottlenecked on DRAM bandwidth it will spend a lot of time with the store buffer being full. (And thus unable to issue new store instructions from the front-end, so it'll stall).
Or maybe dd if=/dev/zero of=/dev/null bs=128M
. That large block size, larger than L3 cache, will get dd to make system calls that do large copy_to/from_user
inside the kernel, basically large memcpy using rep movsb
.
With perf stat
(without --all-user
of course since the work happens in the kernel), I get lots of SB stalls on my Skylake CPU.
taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,resource_stalls.sb,idq.mite_uops -r1 dd if=/dev/zero of=/dev/null bs=128M
^C699+0 records in
699+0 records out
93818191872 bytes (94 GB, 87 GiB) copied, 3.77358 s, 24.9 GB/s
dd: Interrupt
Performance counter stats for 'dd if=/dev/zero of=/dev/null bs=128M':
3,774.08 msec task-clock # 1.000 CPUs utilized
14 context-switches # 3.710 /sec
0 cpu-migrations # 0.000 /sec
139 page-faults # 36.830 /sec
14,714,828,343 cycles # 3.899 GHz
838,605,753 instructions # 0.06 insn per cycle
7,562,186,851 uops_issued.any # 2.004 G/sec
10,094,269,956 uops_executed.thread # 2.675 G/sec
10,770,860,694 resource_stalls.sb # 2.854 G/sec
480,490,899 idq.mite_uops # 127.313 M/sec
3.774573314 seconds time elapsed
0.000000000 seconds user
3.763515000 seconds sys
Note that resource_stalls.sb
is a large fraction of cycles
, so the majority of cycles were stalled on the store-buffer being full. The instructions
count is extremely small because the Linux kernel uses rep movsb
for copying memory, unlike in user-space where AVX2 is available. We can see from uops_issued
and uops_executed
that the average uops per instruction is way above the usual 1 or 2 for typical non-microcoded instructions. And that the uop throughput per clock cycle is pretty reasonable. (idq.mite_uops counts legacy decode, to help detect accidental front-end stalls.)
Another thing you can do is scattered stores over a medium-sized array: small enough to get no TLB misses, but large enough to mostly miss in L3 cache (or at least miss in L2).
Use a fast PRNG, like an LCG or xorshift+. It does not have to be good quality randomness at all, just anything without a small fixed stride. Linux / glibc's rand()
function internally calls random()
, which uses lock cmpxchg
and a memory-destination xchg
, both of which wait for the store buffer to drain before later memory operations can execute. That makes them kind of a showstopper for this experiment. (Although I still did get a few counts for your attempt on my Skylake, like 1 count per 37800 cycles for resource_stalls.sb
with perf stat
, on a binary compiled with gcc -O3
.)
#include <string.h>
#include <stdlib.h>
#define SIZE (64*1024*1024)
int main(void)
{
volatile char *buf = malloc(sizeof(char)*SIZE);
unsigned pos = 0;
while(1) {
pos = pos*317123 + 12345; // A prime multiplier makes the period 2^n, IIRC
// implicit modulo 2^32 with unsigned wrap-around
buf[pos % SIZE] = 'A';
// optional TODO: make the buffer SIZE+1024 bytes or something
// and also store to (pos%SIZE)+64, +128, +192 to have more stores to different cache lines for the same amount of ALU latency.
}
return 0;
}
This LCG is actually not perfectly chosen; the period before it loops back pos = 0
is 2^31, not 2^32. Per this Q&A, choosing the multiplier such that a-1
is a multiple of 4 would make it full period. I'm not doing the %SIZE
inside the LCG dependency chain so the modulus is 2^32. This helps latency and increases the period, but it would actually be better for this use-case to have a period = SIZE, touching every byte once and then repeating, instead of touching some multiple times before touching others at all like we get by taking the low bits of a longer-period PRNG. It works fine on my Skylake.
Compiled with GCC12 -O3
(as on Godbolt), the loop comes to
.L2:
imul edx, edx, 317123
add edx, 12345 # 4 cycle loop-carried dep chain through the PRNG
mov ecx, edx
and ecx, 67108863 # 0x3FFFFFF; cheap modulo by a power of 2
add rcx, rax # IDK why GCC doesn't use an indexed addr mode
mov BYTE PTR [rcx], 65
jmp .L2
So that's a decent number of operations per store, but ROB capacity usually won't limit our ability to stall on the store buffer because stores stay in the store buffer until some time after they retire from the ROB. (i.e. only "graduated" stores are even eligible to commit to L1d cache, and that can't happen until this core completes and RFO (Read For Ownership) on the cache line. Or commits into an LFB (Line Fill Buffer) awaiting that incoming cache line, but to preserve memory ordering, that's probably only for one store, or more to the same cache line.)
If it was a problem, unrolling some to do multiple stores per iteration based off the same random number (to the same or nearby cache lines) could work.
$ gcc -O3 sb.c
$ taskset -c 1 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,resource_stalls.sb,idq.mite_uops -r1 ./a.out
^C./a.out: Interrupt
Performance counter stats for './a.out':
25,007.98 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
595 page-faults # 23.792 /sec
97,155,189,999 cycles # 3.885 GHz
25,489,937,218 instructions # 0.26 insn per cycle
25,609,518,732 uops_issued.any # 1.024 G/sec
25,603,447,903 uops_executed.thread # 1.024 G/sec
83,561,127,518 resource_stalls.sb # 3.341 G/sec
36,858,149 idq.mite_uops # 1.474 M/sec
25.008377007 seconds time elapsed
24.934728000 seconds user
0.006555000 seconds sys
This confirms that my i7-6700k Skylake CPU spent about 86% of its cycles stalled on the store buffer being full. From perf list
, the event definition is:
resource_stalls.sb
[Cycles stalled due to no store buffers available. (not including
draining form sync)]
I'm not sure what definition of "stall" applies here, but probably counting any cycle when the front-end can't issue new instructions because it can't allocate an SB entry, when there would otherwise be room in the ROB and scheduler. (There are other events that count stall cycles on criteria like on uops executed that cycle, but that was for an event related to uop execution.)
This test program should do similar things on any other CPU.
Compiling with optimization is probably good, especially if you have a CPU with 64 MiB of L3 cache; the less work that happens other than stores, the more time it will spend stalled instead of working on the next store address. OTOH, a debug build will store/reload local variables between statements, and those extra stores can't commit early since x86's memory model doesn't allow StoreStore reordering. The amount of stores per instruction might be similar, and as long as some of them miss in L2 or L3 cache, the store buffer will fill up. (On non-x86, compiling with optimization would be more important, otherwise the local variable stores can commit from the store buffer without waiting for the ones that miss in cache.)
If you're on Broadwell, see Why does perf stat not count cycles:u on Broadwell CPU with hyperthreading disabled in BIOS? re: BDE104 - "General-Purpose Performance Monitoring Counters 4-7 Will Not Increment Do Not Count With USR Mode Only Filtering". So you can't safely use --all-user
; leave it out and you should get plenty of counts.