1

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.

  1. W: # of computations performed in the program (FLOPs)
  2. Q: # of bytes of memory accesses incurred in the program (Byte/s)
  3. π: peak performance (FLOPs)
  4. β: peak bandwidth (Byte/s)

I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got

 Performance counter stats for './test_exe':

        83,381,997      r5302c0

      20.134717382 seconds time elapsed

      74.691675000 seconds user
       0.357003000 seconds sys

Questions:

  1. Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
  2. Why is this FLOPs not stable between repeated executions?
  3. How can I measure the other Q, π and β?

Thanks for your time and any suggestions.

Joxixi
  • 651
  • 5
  • 18
  • 1
    With `perf` that isn't more than a few years old, it has names for events like `perf stat -e task_clock,cycles,instructions,uops_issued.any,mem_load_retired.l1_miss` and so on, so you don't need numeric events or the `ocperf.py` wrapper. Check `perf list`. Also see [FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX))](https://stackoverflow.com/posts/comments/107156857) for events like `fp_arith_inst_retired.128b_packed_double ` which actually counts flops (FMA counts as 2), if you multiply by the SIMD element count. – Peter Cordes Aug 15 '21 at 02:44
  • For bandwidth, you might want `intel_gpu_top -l` for a rough system-wide number from the memory controllers. For theoretical max bandwidth, see benchmark results or Intel's specs. For theoretical max FLOPS, see [FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2](https://stackoverflow.com/q/15655835) – Peter Cordes Aug 15 '21 at 02:45
  • Thanks Peter! I am trying the FLOPS part. For bandwidth, I am just using CPUs but not GPUs. – Joxixi Aug 15 '21 at 09:09
  • I know you're not using the GPU. Like I said, **system-wide** number from the memory controllers, which are shared by the CPU and GPU via the ring bus (or mesh on a Skylake-X). The other columns of `intel_gpu_top` output are GPU-specific and thus irrelevant, but it is a handy way to see the system-wide read and write DRAM bandwidth in MB/s. If look at a baseline while the system is idle, that should tell you how much the iGPU is using for scan-out, if it's active at all. (Plus interrupt handlers and whatnot.) – Peter Cordes Aug 15 '21 at 10:07
  • @PeterCordes, thanks for your comments. My GPU is NVIDIA GPU, can I use the method in the same way? – Joxixi Aug 16 '21 at 05:37
  • Did you try it? If `intel_gpu_top` works at all, it will give you bandwidth numbers for the on-chip memory controllers on the CPU. That's what I've told you 3 times now. – Peter Cordes Aug 16 '21 at 10:57

0 Answers0