How to measure Linux process's memory bandwidth on ARMv8 CPUs?

Question

I am running a performance testing on Linux system.
I am wondering if there is a way to measure a process's memory bandwidth?
Now I am using perf to capture the ll_cache_miss_rd data, multiplied it by cacheline size to evaluate the total memory reading, but I am not sure if this way is correct or not, or there is another better way to do the measurement ?

Here is an example data I got.

perf stat -a -e task-clock,cycles,instructions,branch-misses -e stalled-cycles-frontend,stalled-cycles-backend -e cache-references,cache-misses -e LLC-loads,LLC-load-misses -e L1-dcache-loads,L1-dcache-load-misses,l1d_cache,l1d_cache_lmiss_rd,l2d_cache,l2d_cache_lmiss_rd,l3d_cache_lmiss_rd,ll_cache_miss_rd,ll_cache_rd,l1d_cache_refill,l2d_cache_refill,l3d_cache_refill ./memtest -a -p 1 -s 1024 -n 1

 Performance counter stats for 'system wide':

         17,617.93 msec task-clock                #   11.990 CPUs utilized
     1,062,539,439      cycles                    #    0.060 GHz                      (32.49%)
     1,869,176,827      instructions              #    1.76  insn per cycle
                                                  #    0.42  stalled cycles per insn  (32.77%)
           141,232      branch-misses                                                 (33.04%)
        33,822,031      stalled-cycles-frontend   #    3.18% frontend cycles idle     (33.31%)
       785,961,509      stalled-cycles-backend    #   73.97% backend cycles idle      (33.58%)
     1,050,737,719      cache-references          #   59.640 M/sec                    (33.86%)
           593,998      cache-misses              #    0.057 % of all cache refs      (34.13%)
        19,331,089      LLC-loads                 #    1.097 M/sec                    (29.43%)
        19,096,019      LLC-load-misses           #   98.78% of all LL-cache accesses  (29.43%)
     1,098,105,060      L1-dcache-loads           #   62.329 M/sec                    (29.43%)
         1,050,816      L1-dcache-load-misses     #    0.10% of all L1-dcache accesses  (29.36%)
     1,051,152,285      l1d_cache                 #   59.664 M/sec                    (29.09%)
           932,407      l1d_cache_lmiss_rd        #    0.053 M/sec                    (28.82%)
        66,335,528      l2d_cache                 #    3.765 M/sec                    (28.55%)
           901,030      l2d_cache_lmiss_rd        #    0.051 M/sec                    (28.27%)
        17,264,961      l3d_cache_lmiss_rd        #    0.980 M/sec                    (28.00%)
        16,242,678      ll_cache_miss_rd          #    0.922 M/sec                    (27.79%)
        16,521,909      ll_cache_rd               #    0.938 M/sec                    (27.79%)
           498,514      l1d_cache_refill          #    0.028 M/sec                    (27.79%)
           461,947      l2d_cache_refill          #    0.026 M/sec                    (27.79%)
        34,101,918      l3d_cache_refill          #    1.936 M/sec                    (42.24%)

So I found ll_cache_miss_rd is 0.489M/sec, which means 0.489M * 64Byte/sec, did I get the memtest's memory bandwidth in this way?

** Updated **
The memtest is a C program which allocates 512MB memory and read data from it in a loop for (i = 0; i < 512*1024*1024; i++). That is why I capture the event of cache_miss_rd or cache_miss_load.

So far, I have no idea on how to measure the process's memory bandwidth of read AND write..... :-(

My code of reading memory is as follows,

     unsigned long i = 0;
     unsigned char x = 0;
     for(i = 0; i < size; i++) {
         x = ptr[i];
     }

But if I change variable i and x to register variables, like this

     register unsigned long i = 0;
     register unsigned char x = 0;
     for(i = 0; i < size; i++) {
         x = ptr[i];
     }

the perf result is totally different, the number of ll_cache_miss_rd is much less than before, multiplied by 64B (cacheline size) does NOT match the buffer size I tested (1024MB in this case)! As follows,

Performance counter stats for 'system wide':

     16,903.85 msec task-clock                #   11.992 CPUs utilized
   939,400,761      cycles                    #    0.056 GHz                      (32.42%)
 1,184,652,317      instructions              #    1.26  insn per cycle
                                              #    0.62  stalled cycles per insn  (32.42%)
       145,960      branch-misses                                                 (32.55%)
    36,848,105      stalled-cycles-frontend   #    3.92% frontend cycles idle     (32.83%)
   739,398,917      stalled-cycles-backend    #   78.71% backend cycles idle      (33.12%)
   611,812,379      cache-references          #   36.194 M/sec                    (33.40%)
       564,612      cache-misses              #    0.092 % of all cache refs      (33.68%)
       638,695      LLC-loads                 #    0.038 M/sec                    (28.96%)
       475,592      LLC-load-misses           #   74.46% of all LL-cache accesses  (28.96%)
   637,781,610      L1-dcache-loads           #   37.730 M/sec                    (28.96%)
       461,082      L1-dcache-load-misses     #    0.07% of all L1-dcache accesses  (28.96%)
   637,837,862      l1d_cache                 #   37.733 M/sec                    (28.96%)
       352,012      l1d_cache_lmiss_rd        #    0.021 M/sec                    (28.97%)
    30,424,639      l2d_cache                 #    1.800 M/sec                    (28.96%)
       337,131      l2d_cache_lmiss_rd        #    0.020 M/sec                    (28.96%)
       912,304      l3d_cache_lmiss_rd        #    0.054 M/sec                    (28.97%)
     1,624,539      ll_cache_miss_rd          #    0.096 M/sec                    (28.83%)
     2,010,140      ll_cache_rd               #    0.119 M/sec                    (28.55%)
     1,045,832      l1d_cache_refill          #    0.062 M/sec                    (28.27%)
       924,750      l2d_cache_refill          #    0.055 M/sec                    (27.98%)
     2,806,943      l3d_cache_refill          #    0.166 M/sec                    (42.16%)

   1.409586396 seconds time elapsed

So now the question changes to how to trigger real memory reads from C program ? I already added -O0 when compiling this program.

There's a "metric group" for that, where `perf` will calculate DRAM bandwidth for you, similar to [How to calculate the L3 cache bandwidth by using the performance counters linux?](https://stackoverflow.com/q/72597540) — Peter Cordes, Oct 21 '22 at 07:31
Hi Peter, nice to see you again. I am using ARMv8, perf list does NOT show that metric group you just mentioned. So with the raw data from `perf stat -e`, may I deduce the memory bandwidth of the process? — wangt13, Oct 21 '22 at 07:38
IDK the details of what perf events mean on any ARMv8 CPUs. But if `ll_cache_miss_rd` means what it sounds like, it's only loads, not stores, so at best it gives you read bandwidth achieved in a test that might not be read-only. But much more importantly, it might not be counting hardware prefetches that *avoided* a later cache miss. In Intel CPUs, the prefetchers are only in L1d and L2, and there are events that count every L3 miss including ones that come from those prefetchers. But there are also events that count per load instruction, not per cache line request. — Peter Cordes, Oct 21 '22 at 08:41
I updated my question. You are right now I am trying to get the memory bandwidth of reading access, since there are PMU events for Lx_DCACHE_MISS can be used to deduce the DRAM bandwidth. I have no idea on how to measure the bandwidth of read + write... — wangt13, Oct 21 '22 at 13:54
*So now the question changes to how to trigger real memory reads from C program?* Use `volatile int*` to force certain accesses to happen (and enable optimization so there aren't a ton of loads/stores to the stack cluttering up your results). But that just causes an `ldr` or `str` to appear in the asm; it doesn't bypass cache or prefetching. — Peter Cordes, Oct 24 '22 at 13:05
Peter, I added `volatile` before the pointer, and it worked as expected, ie. there is real memory read from the C program. So it seemed that even the C program is compiled with `-O0`, it is still be optimized by reducing the memory loading. And I found another way to make it work even without adding `volatile` keyword, that is as this `sum += ptr[i];` and call a dummy function after the loop like this `dummy_func(sum)`. With variable of sum, and function call, it seemed the compiler did NOT optimze the memory loading/reading codes. You can post your comment as answer. Thanks. — wangt13, Oct 24 '22 at 23:51
`-O0` treats everything sort of like `volatile`, not keeping anything in registers across C statements. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) explains why in more detail. The only optimization of loads that happens at `-O0` is *within* single expressions, like `ptr[i] + ptr[i]` might only load it once. (And might only load `i` from its stack slot once). But yes, using the result in a way that can be optimized away also works. — Peter Cordes, Oct 25 '22 at 00:27
As for why any of those techniques made a difference to your `perf` results, hard to say. Perhaps making your code run faster make HW prefetch unable to keep up, so demand loads were actually missing in cache. If you were only counting demand-load misses, then your "bandwidth" calculation would happen to work in that case. — Peter Cordes, Oct 25 '22 at 00:28
Also related: [Preventing compiler optimizations while benchmarking](https://stackoverflow.com/q/40122141) and [How not to optimize away - mechanics of a folly function](https://stackoverflow.com/q/28287064) - using inline asm to get the compiler to not optimize something away. — Peter Cordes, Oct 25 '22 at 00:30
What I am trying to get from perf is the memory bandwidth of a process. So I wrote a C process accessing some memory to check if perf can report the real memory loads the process does. In this case, I really care about the evnets like llc_data_miss/refill, etc, which can help me deduce the count of RAM access. And thanks for your comments and link for my reference. You can post your comments as the answer. :-) — wangt13, Oct 25 '22 at 00:40
But that *doesn't* answer the real question. If I'm understanding what's going on, you can only measure memory bandwidth in processes that defeat or exceed the hardware prefetcher. So "use optimization to make your bandwidth test more efficient so the HW prefetcher can't keep up" isn't a useful answer for how you can measure the actual bandwidth of something that isn't a bandwidth microbenchmark that just loops over an array. — Peter Cordes, Oct 25 '22 at 00:43
Peter，I have not checked the details of ARMv8 prefetcher, and in my testing, I allocated and read memory of size of 512MB, so it should be larger than the prefetcher. And I think you can focus on `volatile` as the answer..... — wangt13, Oct 25 '22 at 01:13
HW prefetch doesn't have a fixed size. Maybe you're thinking of the cache that HW prefetching prefetches into, or you don't understand what a prefetcher is. https://en.wikipedia.org/wiki/Cache_prefetching — Peter Cordes, Oct 25 '22 at 01:38
Maybe I missunderstood the prefetch, I am not sure the PMU event of l3d_cache_refill really indicates. I assume it is the count of loading cacheline from memory, so maybe I can flush data cache to elimate the prefetching effect? Sorry for my stupid question. I may open another SO on prefetching topic..... — wangt13, Oct 25 '22 at 03:07

How to measure Linux process's memory bandwidth on ARMv8 CPUs?

0 Answers0