Is there any way or even possible to get the overall utilization of a GPU during a period of time?

Question

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time. By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.

Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.

PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.

So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.

I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.

Thank you very much!

VAndrei · Accepted Answer · 2014-11-07T18:47:43.833

6

nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.

I suggest using nVidia Visual Profiler.

The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:

Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.

Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.

Check the metrics reference as it will provide you much more useful information.

I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.

Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.

edited Nov 07 '14 at 18:47

answered Nov 06 '14 at 19:36

VAndrei

5,420
18
43

Thanks for the quick reply. Yes, I am running Linux, and I think nvprof can do the same thing as Visual Profiler. However, even those metrics are not enough for me. – rsm Nov 07 '14 at 07:57
What other metrics do you need? NVprof computes these metrics for each kernel separately and with proper interpreting you can determine more info. See my last statement in the post. – VAndrei Nov 07 '14 at 09:21
Sorry that my last comment wasn't finished before it was posted. sm_efficiency can only be applied to single context, which cannot measure the overall efficiency when there are multiple processes or threads. Previously I also tried flop_count_sp, but it drastically reduced the performance, though I can measure the execution time in another round separately. ipc metric seems useful, but just within a single warp. Anyway, it seems this is the best we can get, as stated in your last note, right? – rsm Nov 07 '14 at 17:15
I think you can do both "system wide" and "single context" collection. Check my revised post. I also added a comment on IPC. IPC is usefull for determining efficiency of executing instructions on all cores. For checking the warp efficiency, you have a metric that's called ... warp_execution_efficiency :). I think the metrics are all you need, only that you need to know exactly how to combine them to get the needed info. – VAndrei Nov 07 '14 at 18:45
Great answer! Thanks, VAndrei! Could you clarify two more things? So how many warp schedulers there in total for tesla k20 (is it 4)? And the reason why only 3 can be active is that it will achieve the maximum number of cores (192)? And I've noticed that using nvprof with those metrics will introduce much overhead. For flops it is OK, but for metrics such as IPC, will they be inaccurate due to the additional measurement overhead? – rsm Nov 10 '14 at 21:55
In K20 each SMX has 4 x dual instruction warp scheduler. This will enable the dispatching of 4 x 32 x 2 (256) instructions to cores. But since SP cores are 192 and DP cores are 64, only 3 schedulers active are needed. However each scheduler needs to issue 2 warps. nvprof has some overhead but I don't expect it to be much. Testing it with small benchmarks will allow you to measure it. Generally it's below 5%. – VAndrei Nov 11 '14 at 11:32
Really appreciate your answer and your time. I've leant a lot from this. Thanks! – rsm Nov 13 '14 at 22:34

Is there any way or even possible to get the overall utilization of a GPU during a period of time?

1 Answers1