How to profile the CUDA application only by nvprof

Question

I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).

GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.

GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof to generate flop_count_sp_* metrics.

So there are two questions that I want to ask:

How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
Is there a faster way to obtain the gpu flops32?

Any suggestion would be appreciated.

================ Update =======================

For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results:

nvvp results

einpoklum · Answer 1 · 2018-05-08T12:17:36.950

You can use nVIDIA's NVTX library to programmatically mark named ranges or points on your timeline. The length of such a range, properly defined, would constitute your "elapsed time", and would show up very clearly in the nvvp visualization tool. Here is a "CUDA pro tip" blog post about doing this:

CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX

and if you want to do this in a more C++-friendly and RAII way, you can use my CUDA runtime API wrappers, which offer a scoped range marker and other utility functions. Of course, with me being the author, take my recommendation with a grain of salt and see what works for you.

About the "Elapsed time" for the session - that's the time between when you start and stop profiling activity. That can either be when the process comes up, or when you explicitly have profiling start. In my own API wrappers, there's a RAII class for that as well: cuda::profiling::scope or of course you can use the C-style API calls explicitly. (I should really write a sample program doing this, I haven't gotten around to that yet, unfortunately).

All clear, thanks a lot. By the way, is there any approach to explicitly obtain the profiling overhead time by your API wrappers? — Lucien Wang, May 09 '18 at 02:03
@LucienWang: If you mean the time taken up by the profiling itself, I'm not sure even nVIDIA's API allow for that. Anyway, no, not in my wrapper library. — einpoklum, May 09 '18 at 09:21

How to profile the CUDA application only by nvprof

1 Answers1