1

I am trying to use linux perf to profile the L3 cache bandwidth gor a python script. I see that there are no available commands to measure that directly. But I know how to get the llc performance counters using the below command. Can anyone let me know on how to calculate the L3 cache bandwidth using the perf counters or refer me to any tools that are available to measure the l3 cache bandwidth? Thanks in advance for the help.

perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches python hello.py
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: [Can the Intel performance monitor counters be used to measure memory bandwidth?](https://stackoverflow.com/q/47612854) for *DRAM* (L3-miss) bandwidth, system wide or per-core. There are totally separate events for L3 cache, like on my Skylake, like `offcore_response.demand_rfo.l3_hit.any_snoop` (stores other than no-RFO NT stores) and `offcore_response.demand_data_rd.l3_hit.any_snoop` (demand loads), and another event for prefetches. IDK, maybe those could be usable. Or possibly `unc_cbo_cache_lookup.any_mesi` for any L3 cache lookup? – Peter Cordes Jun 13 '22 at 03:44
  • Hello, I see. I've a question, can we simply divde the llc cache misses/ total time and get the l3 cache bandwidth? – Kailash gogineni Jun 13 '22 at 03:53
  • LLC *misses*? That would give you something like DRAM bandwidth. L3 accesses are LLC *hits* + misses. – Peter Cordes Jun 13 '22 at 03:58

1 Answers1

1

perf stat has some named "metrics" that it knows how to calculate from other things. According to perf list on my system, those include L3_Cache_Access_BW and L3_Cache_Fill_BW.

  • L3_Cache_Access_BW [Average per-core data access bandwidth to the L3 cache [GB / sec]]
  • L3_Cache_Fill_BW [Average per-core data fill bandwidth to the L3 cache [GB / sec]]

This is from my system with a Skylake (i7-6700k). Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.

I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around. It measured 52 GB/s total bandwidth (read+write I think). The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.

$ echo 4000000 | 
 taskset -c 3 perf stat --all-user  -M L3_Cache_Access_BW -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions  ./sieve 


 Performance counter stats for './sieve-buggy':

     7,711,201,973      offcore_requests.all_requests #  816.916 M/sec                  
                                                  #    52.27 L3_Cache_Access_BW     
     9,441,504,472 ns   duration_time             #    1.000 G/sec                  
          9,439.41 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
             1,020      page-faults               #  108.058 /sec                   
    38,736,147,765      cycles                    #    4.104 GHz                    
    53,699,139,784      instructions              #    1.39  insn per cycle         

       9.441504472 seconds time elapsed

       9.432262000 seconds user
       0.000000000 seconds sys

Or with just -M L3_Cache_Access_BW and no -e events, it just shows offcore_requests.all_requests # 54.52 L3_Cache_Access_BW and duration_time. So it overrides the default and doesn't count cycles,instructions and so on.

I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer. It's counted whether it hits or misses in L3 cache. Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hello, that's great, this actually works for me. Thank you very much. I've one small doubt that how are gettting the size of problem which is 1 MiB? I was just asking as even I could do the same and compare it with my l2 and l3 size. Thanks again. – Kailash gogineni Jun 13 '22 at 04:24
  • @Kailashgogineni: I used `top` to look at the RSS (resident set size). But actually I should have remembered that this tiny hand-written asm program only uses one array, and it allocates a number of bytes equal to the input number. (And since it's a sieve, later iterations don't touch them all. But the naive sieve doesn't skip ahead like it should, so still touches most of them every pass.) And touches no other memory once it's up and running, and startup overhead is negligible, just dynamic linker + 1 malloc call. – Peter Cordes Jun 13 '22 at 04:40
  • Hello @Peter Cordes: Thanks for the answer. Do you have any idea about how do we specify multiple metrics at once? For example, I'm trying something like this "perf stat -M L3_Cache_Access_BW,DRAM_BW_Use python main.py ABC XYZ" I got an error when I execute this. Error: "Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (arb/event=0x81,umask=0x1/). /bin/dmesg | grep -i perf may provide additional information." – Kailash gogineni Jun 16 '22 at 16:49
  • @Kailashgogineni: Multiple metrics work fine for me, e.g. `-M L3_Cache_Access_BW,L2_Cache_Fill_BW`. But my system has the same problem as yours: `-M DRAM_BW_Use` doesn't work at all, even when I try it on its own. – Peter Cordes Jun 16 '22 at 20:23