What is the meaning of GPU performance counters and driver counters?

Question

https://developer.nvidia.com/sites/default/files/akamai/tools/files/PerfKit_4_5_User_Guide.pdf

NVIDIA PerfKit SDK allows graphics developers access to low-level NVIDIA GPU
performance counters and NVIDIA driver counters.

I want to understand the meaning of these counters? Are they some kind of hardware or software? What do they do?

How they are helpful to me? Please give examples of making use of them.

I have to use Nvidia perfkit to determine the performance of certain softwares dealing with robotics.

The perf counters are probably hardware counters, while the driver counters should be software. They count architectural and microarchitectural events (e.g. the number of accesses to a particular memory, number of instructions executed). The driver counters count API events (I think) like the number of times a draw is requested, a buffer is allocated etc. They are used to get a feedback on the bottlenecks of an application. — Margaret Bloom, Mar 08 '18 at 09:41
@MargaretBloom Consider posting an answer. The driver counters are listed in Table 1 and Table 2 in the user guide document. — Hadi Brais, Mar 09 '18 at 00:31
If you can tell more about what you are trying to do exactly, then you may get better answers. Or do you just want a general answer? — Hadi Brais, Mar 26 '18 at 17:16
@HadiBrais I have to use Nvidia perfkit to determine the performance of certain softwares dealing with robotics. — Aquarius_Girl, Apr 01 '18 at 09:22

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

There are many different kinds of performance issues that may occur in GPUs, each requires using different performance counters to identify and analyze. You can consider Peter's answer as a starting point. I'll discuss here the different profiling tools and libraries that are available to you and how to get started with each of them.

Libraries

NVIDIA PerfKit is a library that gives you access to performance counters. Which means that you have to write to code to call the APIs provided the by the library to measure performance events of interest. So using PerfKit may not be the best approach to diagnose performance issues. Anyway, you can find the user guide of PerfKit at the bottom of this page. You can download it from here.

The CUDA Toolkit comes with its own similar library, called the CUDA Profiling Tools Interface (CUPTI). The user guide can be found here. These slides also discuss how to use some performance counters through CUPTI.

Tools

The CUDA Toolkit comes with two profiling tools, one is a command-line tool, called nvprof and the other is a GUI tool, called the Visual Profiler. These tools are built on top of CUPTI and they are much easier to use since you don't have to write any code (or just a little code), so I think you should use one of these tools. The user guides of both of these tools can be found here. Here is a video on the performance analysis with the Visual Profiler. Here is an introductory article on nvprof. Here are also a bunch of videos on how to analyze all kinds of issues using different tools.

It's to difficult for anyone to give you a specific answer because you are not asking about a specific problem. But I think my answer and Peter's answer should help you to get started.

score 5 · Answer 2 · answered Mar 31 '18 at 18:26

I'm not a graphics programmer, so I'll approach this from a computer-architecture perspective. I have no idea which counters in particular are useful for looking for what kind of bottlenecks in 3D graphics or in GPU-computing, so don't read anything into which counters I chose as examples.

When you call a graphics function, a lot of the heavy lifting is done by dedicated GPU hardware.

But to keep that GPU hardware fed, the driver software running on the main CPU has to do significant work, and sometimes that can be a bottleneck. There are "driver counters" to track various things that the software is doing / waiting for, as well as hardware counters to track what the GPU hardware is actually doing.

A graphics card is like a separate computer with processor + memory, but the processor is a specialized GPU whose instruction-set is designed for doing things GPUs are good at. But it still has its own clock and decodes / executes instructions like a pipelined CPU. GPU performance events can count things like number of single-precision floating point operations executed on this hardware. Or cache hit/miss events for for the GPU accessing its own memory (it has its own cache for video RAM). The counters are tracked by hardware built-in to the GPU pipeline.

NVidia has a table of GPU hardware events that their hardware tracks. It includes stuff like texture_busy, which counts "clock cycles the texture unit is busy". Comparing that to the total clock cycles for the period you profiled would tell you how close you came to maxing out / bottlenecking on the hardware throughput for the texture unit. Or shaded_pixel_count: Number of rasterized pixels sent to the shading units. Within the hardware events, they're broken down by which part of the GPU hardware: there are general "GPU" events like those, "SM" (shader) events like inst_executed_vs "Instructions executed by vertex shaders (VS), not including replays.", and Cache events like l1_l2_requests "Number of L2 requests from the L1 unit." (highly related to number of L1 misses, I'd assume). Also Memory events, like sm_inst_executed_local_loads "Local load instructions executed."

(The above examples are "for GPUs with architectures earlier than Kepler"; it turns out the first google hit I found was a page for older GPUs. That doesn't change the fundamentals: GPU events are low-level things the hardware can track, but that software on the CPU usually couldn't. It doesn't know whether there will be cache misses when sending work to the GPU.)

That table breaks the events up into "Graphics" vs. "Compute" APIs. Maybe some of the events are synthesized from actual HW counters by NVidia's software. It is documenting what NVidia's developer tools can show you, not what the hardware actually counts. e.g. inst_executed_cs_ratio is probably derived from a counter of Compute Shader instructions executed and another HW counter of total instruction executed.

These hardware performance counters are (probably) implemented very much like the hardware CPU performance counters, that can count clock cycles, instructions, uops, stalls for various microarchitectural resources, and so on. On x86 CPUs, the counters overflow periodically and generate an interrupt (or record a sample internally in a buffer), so you can get a better picture of what exactly the CPU did while running a loop, for example. But anyway, OProfile has a table of events supported by Haswell, if you want to compare what kind of events a CPU can report vs. a GPU. There's a counter for l2_rqsts like NVidia has, but unlike GPUs there are counters for branch-mispredicts and other things that GPUs don't have.

Driver events include things like: OGL driver sleeping: "OpenGL Last frame mSec sleeping in OGL driver", or OGL vidmem bytes "OGL Current amount of video memory (local video memory) allocated in bytes. Drawables and render targets are not counted."

And also simple totals like OGL Frame Primitive Count and OGL Frame Vertex Count to see how much total work the driver is sending to the GPU.

Driver counters include things like cpu_load and cpu_00_frequency to track how close to CPU-bound you are.

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

These are high-level things that the driver keeps track of in software, not low-level things that are counted in hardware and queried to get the total count when requested.

score 0 · Answer 3 · answered Mar 31 '18 at 14:05

From official nvidia site:

It gives you access to low-level performance counters inside the driver and hardware counters inside the GPU itself. The counters can be used to determine exactly how your application is using the GPU, identify performance issues, and confirm that performance problems have been resolved.

That mean, you can get information about your application GPU usage and performance and is used to find performance issues.

Also some links:

Doesn't answer the question. You don't even mention the difference between counters for GPU hardware events vs. driver software events. — Peter Cordes, Mar 31 '18 at 17:14

What is the meaning of GPU performance counters and driver counters?

3 Answers3

Libraries

Tools