0

I want to perform profiling using perf, more explicitly record CPU stall events. I think I found the right PMU events registers address, but because my CPU is relatively new (AMD EPYC 7443) and my kernel is not the latest (v4.x.x or v5.x.x), i'm not sure if it will work.

To check if my system support this CPU and that I found the right event registers I thought to run some command or create a simple App that makes the CPU stall because of store buffer miss/overflow. Not sure exactly how to do it or which cmd will cause this events to happen.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Brave
  • 159
  • 11
  • 1
    Would it work to do a store that misses cache (because you `clflush`ed the relevant line), followed in program order by a long string of other stores, perhaps ones that would hit L1 cache? Given that all stores are release on x86, the later stores can't pass the earlier slow one, so they will accumulate in the store buffer until it's full. – Nate Eldredge Mar 15 '23 at 19:23
  • @NateEldredge: Yes, that will work. Or just scattered stores to a medium-sized array (larger than L2 or maybe L3 cache but small enough to avoid TLB misses), using a fast simple PRNG to generate new store addresses (array indexes). That will definitely bottleneck on the store buffer being full. In fact, so will a large memset if you want to keep it simple. (Even if it uses NT stores, those should go in the store buffer while they're speculative, before they commit to write-combining buffers.) If you just want to see counts for a store-buffer-full event, that'll do it. – Peter Cordes Mar 15 '23 at 19:43
  • This wouldn't be *simulating* stalls, it's actually causing them on the real CPU so the HW performance counter events happen. – Peter Cordes Mar 15 '23 at 19:47
  • Maybe the use of the word "simulating" is wrong, I do want to actually cause the counter to advance. I will try to use the ideas you gave me and maybe publish here the test application if I succeed. – Brave Mar 15 '23 at 19:51

3 Answers3

2

You could just do a big memset; bottlenecked on DRAM bandwidth it will spend a lot of time with the store buffer being full. (And thus unable to issue new store instructions from the front-end, so it'll stall).

Or maybe dd if=/dev/zero of=/dev/null bs=128M. That large block size, larger than L3 cache, will get dd to make system calls that do large copy_to/from_user inside the kernel, basically large memcpy using rep movsb.

With perf stat (without --all-user of course since the work happens in the kernel), I get lots of SB stalls on my Skylake CPU.

taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,resource_stalls.sb,idq.mite_uops   -r1 dd if=/dev/zero of=/dev/null bs=128M
^C699+0 records in
699+0 records out
93818191872 bytes (94 GB, 87 GiB) copied, 3.77358 s, 24.9 GB/s
dd: Interrupt

 Performance counter stats for 'dd if=/dev/zero of=/dev/null bs=128M':

          3,774.08 msec task-clock                       #    1.000 CPUs utilized             
                14      context-switches                 #    3.710 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               139      page-faults                      #   36.830 /sec                      
    14,714,828,343      cycles                           #    3.899 GHz                       
       838,605,753      instructions                     #    0.06  insn per cycle            
     7,562,186,851      uops_issued.any                  #    2.004 G/sec                     
    10,094,269,956      uops_executed.thread             #    2.675 G/sec                     
    10,770,860,694      resource_stalls.sb               #    2.854 G/sec                     
       480,490,899      idq.mite_uops                    #  127.313 M/sec                     

       3.774573314 seconds time elapsed

       0.000000000 seconds user
       3.763515000 seconds sys

Note that resource_stalls.sb is a large fraction of cycles, so the majority of cycles were stalled on the store-buffer being full. The instructions count is extremely small because the Linux kernel uses rep movsb for copying memory, unlike in user-space where AVX2 is available. We can see from uops_issued and uops_executed that the average uops per instruction is way above the usual 1 or 2 for typical non-microcoded instructions. And that the uop throughput per clock cycle is pretty reasonable. (idq.mite_uops counts legacy decode, to help detect accidental front-end stalls.)


Another thing you can do is scattered stores over a medium-sized array: small enough to get no TLB misses, but large enough to mostly miss in L3 cache (or at least miss in L2).

Use a fast PRNG, like an LCG or xorshift+. It does not have to be good quality randomness at all, just anything without a small fixed stride. Linux / glibc's rand() function internally calls random(), which uses lock cmpxchg and a memory-destination xchg, both of which wait for the store buffer to drain before later memory operations can execute. That makes them kind of a showstopper for this experiment. (Although I still did get a few counts for your attempt on my Skylake, like 1 count per 37800 cycles for resource_stalls.sb with perf stat, on a binary compiled with gcc -O3.)

#include <string.h>
#include <stdlib.h>

#define SIZE (64*1024*1024)

int main(void)
{
    volatile char *buf = malloc(sizeof(char)*SIZE);

    unsigned pos = 0;
    while(1) {
        pos = pos*317123 + 12345;   // A prime multiplier makes the period 2^n, IIRC
                 // implicit modulo 2^32 with unsigned wrap-around

        buf[pos % SIZE] = 'A';
       // optional TODO: make the buffer SIZE+1024 bytes or something
       // and also store to (pos%SIZE)+64, +128, +192 to have more stores to different cache lines for the same amount of ALU latency.
    }

    return 0;
}

This LCG is actually not perfectly chosen; the period before it loops back pos = 0 is 2^31, not 2^32. Per this Q&A, choosing the multiplier such that a-1 is a multiple of 4 would make it full period. I'm not doing the %SIZE inside the LCG dependency chain so the modulus is 2^32. This helps latency and increases the period, but it would actually be better for this use-case to have a period = SIZE, touching every byte once and then repeating, instead of touching some multiple times before touching others at all like we get by taking the low bits of a longer-period PRNG. It works fine on my Skylake.

Compiled with GCC12 -O3 (as on Godbolt), the loop comes to

.L2:
        imul    edx, edx, 317123
        add     edx, 12345            # 4 cycle loop-carried dep chain through the PRNG
        mov     ecx, edx
        and     ecx, 67108863        # 0x3FFFFFF; cheap modulo by a power of 2
        add     rcx, rax             # IDK why GCC doesn't use an indexed addr mode
        mov     BYTE PTR [rcx], 65
        jmp     .L2

So that's a decent number of operations per store, but ROB capacity usually won't limit our ability to stall on the store buffer because stores stay in the store buffer until some time after they retire from the ROB. (i.e. only "graduated" stores are even eligible to commit to L1d cache, and that can't happen until this core completes and RFO (Read For Ownership) on the cache line. Or commits into an LFB (Line Fill Buffer) awaiting that incoming cache line, but to preserve memory ordering, that's probably only for one store, or more to the same cache line.)

If it was a problem, unrolling some to do multiple stores per iteration based off the same random number (to the same or nearby cache lines) could work.

$ gcc -O3 sb.c
$ taskset -c 1 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,resource_stalls.sb,idq.mite_uops   -r1 ./a.out 
^C./a.out: Interrupt

 Performance counter stats for './a.out':

         25,007.98 msec task-clock                       #    1.000 CPUs utilized             
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               595      page-faults                      #   23.792 /sec                      
    97,155,189,999      cycles                           #    3.885 GHz                       
    25,489,937,218      instructions                     #    0.26  insn per cycle            
    25,609,518,732      uops_issued.any                  #    1.024 G/sec                     
    25,603,447,903      uops_executed.thread             #    1.024 G/sec                     
    83,561,127,518      resource_stalls.sb               #    3.341 G/sec                     
        36,858,149      idq.mite_uops                    #    1.474 M/sec                     

      25.008377007 seconds time elapsed

      24.934728000 seconds user
       0.006555000 seconds sys

This confirms that my i7-6700k Skylake CPU spent about 86% of its cycles stalled on the store buffer being full. From perf list, the event definition is:

resource_stalls.sb

  [Cycles stalled due to no store buffers available. (not including
  draining form sync)]

I'm not sure what definition of "stall" applies here, but probably counting any cycle when the front-end can't issue new instructions because it can't allocate an SB entry, when there would otherwise be room in the ROB and scheduler. (There are other events that count stall cycles on criteria like on uops executed that cycle, but that was for an event related to uop execution.)


This test program should do similar things on any other CPU.

Compiling with optimization is probably good, especially if you have a CPU with 64 MiB of L3 cache; the less work that happens other than stores, the more time it will spend stalled instead of working on the next store address. OTOH, a debug build will store/reload local variables between statements, and those extra stores can't commit early since x86's memory model doesn't allow StoreStore reordering. The amount of stores per instruction might be similar, and as long as some of them miss in L2 or L3 cache, the store buffer will fill up. (On non-x86, compiling with optimization would be more important, otherwise the local variable stores can commit from the store buffer without waiting for the ones that miss in cache.)

If you're on Broadwell, see Why does perf stat not count cycles:u on Broadwell CPU with hyperthreading disabled in BIOS? re: BDE104 - "General-Purpose Performance Monitoring Counters 4-7 Will Not Increment Do Not Count With USR Mode Only Filtering". So you can't safely use --all-user; leave it out and you should get plenty of counts.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

Edit: Hi. Thank you for the explanation. I have also a system with Sandy Bridge CPU. I've tried your example and I saw resource_stalls.sb events. Much less than what you saw but still.

Regarding the PMC events on the AMD based machine. Like I said before in the MAN for Zen2 (my CPU is Zen 3) I found "PMCx087 [Instruction Pipe Stall]" and you can count specifically "instruction cache pipeline was stalled during this clock cycle due to upstream not providing fetch address quickly".

I count this event and compare it with the results I get from the Sandy Bridge based machine. I know that it's not an exact science, but I can definitely see correlation between resource_stalls.sb on intel and r287 on AMD:

Counting resource_stalls.sb on intel vs r287 on AMD:

  • Running for 5 sec both systems count 7k events.
  • Running for 20 sec 27k events on Intel 25k on AMD.
  • Running for 30 sec 43k on Intel 37k on AMD
  • Running for 1 min 88k on Intel 78k on AMD

When examine which instruction cause the events on the Intel machine it's spread evenly between instructions in the loop. On the AMD machine it was almost only on the jmp instruction of the loop.

I've posted a question in the AMD and asked which event I should count and also about the support of kernel v4 and v5 in the EPYC 7003 family of CPUs. I hope I will get an answer. For now I think I will use the r287 event on the AMD machine.

I run the following App and used the cmd perf stat -e r<PMC_ADD> -p <app_pid> to count the events. Regarding the PMC register address, I'm not sure which value is the right one. According to the Man of Zen 2 PMCx087 counts instruction pipeline stalls. When I used it with mask 6 (r687) I got:

"Performance counter stats for process id <app_pid>:

r687:u - 5,815,942,035

4.004178288 seconds time elapsed"

According to the Man of Zen 3 (my system CPU is Zen 3) arch document PMC0x0AE counts "Dispatch Resource Stall Cycles 1". That include: "LoadQueueRsrcStall: Load Queue resource stall. Read-write. Reset: 0. Applies to all ops with load semantics" "StoreQueueRsrcStall: Store Queue resource stall. Read-write. Reset: 0. Applies to all ops with store semantics."

When counting this events I got:

r4ae:u - 10,969,085

r2ae:u - 20

What do you think?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

#define L2_CACHE_SIZE 512*1024
#define STORE_LENGTH 128

int main()
{
    char *buf = malloc(sizeof(char)*L2_CACHE_SIZE);
    int r;
    srand(time(NULL));
    while(1) {
        r = rand() % (L2_CACHE_SIZE - STORE_LENGTH);
        memset(buf+r, 'A', STORE_LENGTH);
    }

    return 0;
}
Brave
  • 159
  • 11
  • On my Skylake (256K L2 cache) that produces some counts for `resource_stalls.sb` with `perf stat` (compiled with `gcc -O3`), but only on average 1 per 37802 cycles. The buf is pretty small so you get mostly L2 hits, that many stores have significant locality, and `rand()` is slow. Apparently on Linux, `rand()` involves both a `lock cmpxchg` retry loop, and a memory-destination `xchg` (implicit `lock` prefix), so a full memory barrier that waits for the store buffer to drain every iteration. A fast PRNG like xorshift or a simple LCG would be much better, and do your modulo with a power of 2. – Peter Cordes Mar 15 '23 at 22:18
  • Either do one big memset, or do scattered `volatile chat*` stores. Maybe a few nearby stores for each random number is ok, a few store-buffer entries for the same cache line won't hurt. Also, definitely use a bigger buffer. You don't want L2 cache hits, you want at most L3 hits, or L3 misses. – Peter Cordes Mar 15 '23 at 22:20
  • IDK why Sandybridge wouldn't be stalling on SB capacity with my test loop. 7k per 5 seconds is *way* too low; could just be background noise like interrupts, especially if you didn't use `perf stat --all-user`? On my Skylake, compiling with `gcc -O3` or `gcc -O0` both make binaries that get counts on over 80% of their cycles; hard to imagine Sandybridge could be so different. (e.g. DRAM latency probably isn't lower in core clock cycles, and the size of the ROB is still big enough compared to the SB size. And ALU latency is the same, and the front-end is wide enough.) Maybe an erratum? – Peter Cordes Mar 16 '23 at 08:38
  • "instruction cache pipeline" sounds like its counting front-end stalls on indirect branches, or direct relative branches that aren't predicted early enough as in [Slow jmp-instruction](https://stackoverflow.com/q/38811901). Nothing to do with the store buffer. – Peter Cordes Mar 16 '23 at 08:39
  • Hi.I compiled with `gcc -O3`. I run it again for 25 sec like you did with this cmd: `perf record --all-user -e context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,resource_stalls.sb,idq.mite_uops -p $(pidof app)` . I Got 36k resource_stalls.sb events. Kernel version of the Sandy bridge system is old, 3.10.0, maybe it's related?. CPU is "Intel(R) Xeon(R) CPU E5-2680 v4". – Brave Mar 16 '23 at 09:30
  • Regarding the event description. Maybe it's misleading. This piece of code is from the kernel src code: `/* * AMD Performance Monitor Family 17h and later: */ static const u64 amd_f17h_perfmon_event_map[PERF_COUNT_HW_MAX] = { [PERF_COUNT_HW_CPU_CYCLES] = 0x0076, [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0, [PERF_COUNT_HW_CACHE_REFERENCES] = 0xff60, [PERF_COUNT_HW_CACHE_MISSES] = 0x0964, [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c2, [PERF_COUNT_HW_BRANCH_MISSES] = 0x00c3, [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x0287, [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x0187, };` – Brave Mar 16 '23 at 09:39
  • That's not a Sandy Bridge, it's Broadwell. Note the "v4" in your Xeon model number! Broadwell has an erratum that affects perf counters: see [Why does perf stat not count cycles:u on Broadwell CPU with hyperthreading disabled in BIOS?](https://stackoverflow.com/q/75601428) re: BDE104 - "General-Purpose Performance Monitoring Counters 4-7 Will Not Increment Do Not Count With USR Mode Only Filtering". So you can't safely use `--all-user`; leave it out and you should get plenty of counts. – Peter Cordes Mar 16 '23 at 16:26
  • You are right, that is not a Sandy Bridge. My first mistake was that I used `record` and not `stat`. With stats I'm getting different numbers..., 21,498,970,259 SB after 25 sec. Same story in the AMD case, changing to `stat` gave 25,226,316 idle backend and 36,392,260 frontend. Now we can see that like you said this events are probably not SB events. Now I need to understand which event on AMD Zen 3 is SB event. – Brave Mar 16 '23 at 21:07
-1

I run the following App and used the cmd perf stat -e r<PMC_ADD> -p <app_pid> to count the events. Regarding the PMC register address, I'm not sure which value is the right one. According to the Man of Zen 2 PMCx087 counts instruction pipeline stalls. When I used it with mask 6 (r687) I got:

"Performance counter stats for process id <app_pid>:

r687:u - 5,815,942,035

4.004178288 seconds time elapsed"

According to the Man of Zen 3 (my system CPU is Zen 3) arch document PMC0x0AE counts "Dispatch Resource Stall Cycles 1". That include: "LoadQueueRsrcStall: Load Queue resource stall. Read-write. Reset: 0. Applies to all ops with load semantics" "StoreQueueRsrcStall: Store Queue resource stall. Read-write. Reset: 0. Applies to all ops with store semantics."

When counting this events I got:

r4ae:u - 10,969,085

r2ae:u - 20

What do you think?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

#define L2_CACHE_SIZE 512*1024
#define STORE_LENGTH 128

int main()
{
    char *buf = malloc(sizeof(char)*L2_CACHE_SIZE);
    int r;
    srand(time(NULL));
    while(1) {
        r = rand() % (L2_CACHE_SIZE - STORE_LENGTH);
        memset(buf+r, 'A', STORE_LENGTH);
    }

    return 0;
}
Brave
  • 159
  • 11
  • 2
    Hi, StackOveflow doesn't follow the format of a forum. We don't have follow-ups/threads/discussions here. As such this post is not really an answer as much as it is a new question. Post it as a new question entirely, bringing along any necessary context to understand it. This way we can properly answer it. – Margaret Bloom Mar 19 '23 at 14:33
  • Why are you testing this version using `rand()` again? Is it because you want to test something that includes a `lock`ed instruction again? Also, this looks like a repost of your first answer, same code and same intro text which is still there in the later half of that answer. If there's some difference, mention it. Also, only post *answers* as answers; we've already established that this specific code will very rarely bottleneck on the store-buffer being full. – Peter Cordes Mar 19 '23 at 20:20
  • I removed something I added to this message following Margaret comment, and I will maybe post new question regarding SB events on AMD EPYC when I have more information. I'm not using this code anymore. – Brave Mar 20 '23 at 06:57