I am running a performance testing on Linux system.
I am wondering if there is a way to measure a process's memory bandwidth?
Now I am using perf to capture the ll_cache_miss_rd
data, multiplied it by cacheline size to evaluate the total memory reading, but I am not sure if this way is correct or not, or there is another better way to do the measurement ?
Here is an example data I got.
perf stat -a -e task-clock,cycles,instructions,branch-misses -e stalled-cycles-frontend,stalled-cycles-backend -e cache-references,cache-misses -e LLC-loads,LLC-load-misses -e L1-dcache-loads,L1-dcache-load-misses,l1d_cache,l1d_cache_lmiss_rd,l2d_cache,l2d_cache_lmiss_rd,l3d_cache_lmiss_rd,ll_cache_miss_rd,ll_cache_rd,l1d_cache_refill,l2d_cache_refill,l3d_cache_refill ./memtest -a -p 1 -s 1024 -n 1
Performance counter stats for 'system wide':
17,617.93 msec task-clock # 11.990 CPUs utilized
1,062,539,439 cycles # 0.060 GHz (32.49%)
1,869,176,827 instructions # 1.76 insn per cycle
# 0.42 stalled cycles per insn (32.77%)
141,232 branch-misses (33.04%)
33,822,031 stalled-cycles-frontend # 3.18% frontend cycles idle (33.31%)
785,961,509 stalled-cycles-backend # 73.97% backend cycles idle (33.58%)
1,050,737,719 cache-references # 59.640 M/sec (33.86%)
593,998 cache-misses # 0.057 % of all cache refs (34.13%)
19,331,089 LLC-loads # 1.097 M/sec (29.43%)
19,096,019 LLC-load-misses # 98.78% of all LL-cache accesses (29.43%)
1,098,105,060 L1-dcache-loads # 62.329 M/sec (29.43%)
1,050,816 L1-dcache-load-misses # 0.10% of all L1-dcache accesses (29.36%)
1,051,152,285 l1d_cache # 59.664 M/sec (29.09%)
932,407 l1d_cache_lmiss_rd # 0.053 M/sec (28.82%)
66,335,528 l2d_cache # 3.765 M/sec (28.55%)
901,030 l2d_cache_lmiss_rd # 0.051 M/sec (28.27%)
17,264,961 l3d_cache_lmiss_rd # 0.980 M/sec (28.00%)
16,242,678 ll_cache_miss_rd # 0.922 M/sec (27.79%)
16,521,909 ll_cache_rd # 0.938 M/sec (27.79%)
498,514 l1d_cache_refill # 0.028 M/sec (27.79%)
461,947 l2d_cache_refill # 0.026 M/sec (27.79%)
34,101,918 l3d_cache_refill # 1.936 M/sec (42.24%)
So I found ll_cache_miss_rd is 0.489M/sec, which means 0.489M * 64Byte/sec, did I get the memtest's memory bandwidth in this way?
** Updated **
The memtest is a C program which allocates 512MB memory and read data from it in a loop for (i = 0; i < 512*1024*1024; i++)
. That is why I capture the event of cache_miss_rd or cache_miss_load.
So far, I have no idea on how to measure the process's memory bandwidth of read AND write..... :-(
My code of reading memory is as follows,
unsigned long i = 0;
unsigned char x = 0;
for(i = 0; i < size; i++) {
x = ptr[i];
}
But if I change variable i and x to register variables, like this
register unsigned long i = 0;
register unsigned char x = 0;
for(i = 0; i < size; i++) {
x = ptr[i];
}
the perf result is totally different, the number of ll_cache_miss_rd is much less than before, multiplied by 64B (cacheline size) does NOT match the buffer size I tested (1024MB in this case)! As follows,
Performance counter stats for 'system wide':
16,903.85 msec task-clock # 11.992 CPUs utilized
939,400,761 cycles # 0.056 GHz (32.42%)
1,184,652,317 instructions # 1.26 insn per cycle
# 0.62 stalled cycles per insn (32.42%)
145,960 branch-misses (32.55%)
36,848,105 stalled-cycles-frontend # 3.92% frontend cycles idle (32.83%)
739,398,917 stalled-cycles-backend # 78.71% backend cycles idle (33.12%)
611,812,379 cache-references # 36.194 M/sec (33.40%)
564,612 cache-misses # 0.092 % of all cache refs (33.68%)
638,695 LLC-loads # 0.038 M/sec (28.96%)
475,592 LLC-load-misses # 74.46% of all LL-cache accesses (28.96%)
637,781,610 L1-dcache-loads # 37.730 M/sec (28.96%)
461,082 L1-dcache-load-misses # 0.07% of all L1-dcache accesses (28.96%)
637,837,862 l1d_cache # 37.733 M/sec (28.96%)
352,012 l1d_cache_lmiss_rd # 0.021 M/sec (28.97%)
30,424,639 l2d_cache # 1.800 M/sec (28.96%)
337,131 l2d_cache_lmiss_rd # 0.020 M/sec (28.96%)
912,304 l3d_cache_lmiss_rd # 0.054 M/sec (28.97%)
1,624,539 ll_cache_miss_rd # 0.096 M/sec (28.83%)
2,010,140 ll_cache_rd # 0.119 M/sec (28.55%)
1,045,832 l1d_cache_refill # 0.062 M/sec (28.27%)
924,750 l2d_cache_refill # 0.055 M/sec (27.98%)
2,806,943 l3d_cache_refill # 0.166 M/sec (42.16%)
1.409586396 seconds time elapsed
So now the question changes to how to trigger real memory reads from C program
? I already added -O0
when compiling this program.