I'm not a graphics programmer, so I'll approach this from a computer-architecture perspective. I have no idea which counters in particular are useful for looking for what kind of bottlenecks in 3D graphics or in GPU-computing, so don't read anything into which counters I chose as examples.
When you call a graphics function, a lot of the heavy lifting is done by dedicated GPU hardware.
But to keep that GPU hardware fed, the driver software running on the main CPU has to do significant work, and sometimes that can be a bottleneck. There are "driver counters" to track various things that the software is doing / waiting for, as well as hardware counters to track what the GPU hardware is actually doing.
A graphics card is like a separate computer with processor + memory, but the processor is a specialized GPU whose instruction-set is designed for doing things GPUs are good at. But it still has its own clock and decodes / executes instructions like a pipelined CPU. GPU performance events can count things like number of single-precision floating point operations executed on this hardware. Or cache hit/miss events for for the GPU accessing its own memory (it has its own cache for video RAM). The counters are tracked by hardware built-in to the GPU pipeline.
NVidia has a table of GPU hardware events that their hardware tracks. It includes stuff like texture_busy
, which counts "clock cycles the texture unit is busy". Comparing that to the total clock cycles for the period you profiled would tell you how close you came to maxing out / bottlenecking on the hardware throughput for the texture unit. Or shaded_pixel_count
: Number of rasterized pixels sent to the shading units. Within the hardware events, they're broken down by which part of the GPU hardware: there are general "GPU" events like those, "SM" (shader) events like inst_executed_vs
"Instructions executed by vertex shaders (VS), not including replays.", and Cache events like l1_l2_requests
"Number of L2 requests from the L1 unit." (highly related to number of L1 misses, I'd assume). Also Memory events, like sm_inst_executed_local_loads
"Local load instructions executed."
(The above examples are "for GPUs with architectures earlier than Kepler"; it turns out the first google hit I found was a page for older GPUs. That doesn't change the fundamentals: GPU events are low-level things the hardware can track, but that software on the CPU usually couldn't. It doesn't know whether there will be cache misses when sending work to the GPU.)
That table breaks the events up into "Graphics" vs. "Compute" APIs. Maybe some of the events are synthesized from actual HW counters by NVidia's software. It is documenting what NVidia's developer tools can show you, not what the hardware actually counts. e.g. inst_executed_cs_ratio
is probably derived from a counter of Compute Shader instructions executed and another HW counter of total instruction executed.
These hardware performance counters are (probably) implemented very much like the hardware CPU performance counters, that can count clock cycles, instructions, uops, stalls for various microarchitectural resources, and so on. On x86 CPUs, the counters overflow periodically and generate an interrupt (or record a sample internally in a buffer), so you can get a better picture of what exactly the CPU did while running a loop, for example. But anyway, OProfile has a table of events supported by Haswell, if you want to compare what kind of events a CPU can report vs. a GPU. There's a counter for l2_rqsts
like NVidia has, but unlike GPUs there are counters for branch-mispredicts and other things that GPUs don't have.
Driver events include things like: OGL driver sleeping
: "OpenGL Last frame mSec sleeping in OGL driver", or OGL vidmem bytes
"OGL Current amount of video memory (local video memory) allocated in bytes. Drawables and render targets are not counted."
And also simple totals like OGL Frame Primitive Count
and OGL Frame Vertex Count
to see how much total work the driver is sending to the GPU.
Driver counters include things like cpu_load
and cpu_00_frequency
to track how close to CPU-bound you are.
All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.
These are high-level things that the driver keeps track of in software, not low-level things that are counted in hardware and queried to get the total count when requested.