I want to write a script to profile my cuda application only using the command tool nvprof
. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace
, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp
to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.
GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof
to generate flop_count_sp_*
metrics.
So there are two questions that I want to ask:
- How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
- Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.
================ Update =======================
For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results: