Interpreting the verbose output of ptxas, part I

Question

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

I compiled my kernel.cu file to a kernel.o file with nvcc -arch=sm_20 -ptxas-options=-v

and I got the following output (passed through c++filt):

ptxas info    : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info    : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]

Looking at the output above, is it correct to say that

each CUDA thread is using 46 registers?
there is no register spilling to local memory?

I am also having some issues with understanding the output.

My kernel is calling a whole lot of __device__ functions. IS 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions?
What is the difference between 0 byte spill stores and 0 bytes spill loads
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?

_{This question is "continued" in: Interpreting the verbose output of ptxas, part II}

'Used 46 registers' indicates the compiler has reserved 46 registers per thread for the compiled kernel and the other registers are spilled. You can find the number of spilled registers by subtracting this number (46) from the total number of register used in the kernel's PTX. — lashgar, Sep 12 '12 at 12:50
@Ahmad: You first sentence is correct, but the second is not. A kernel can use less than the maximum permissible registers per thread and have no spills to local memory. — talonmies, Sep 12 '12 at 13:24
To elaborate on talonmies reply, PTX is a high-level abstraction with infinite registers. That's because it can be compiled for multiple generations of GPU and the number of registers can be different. It's only when you compile down to the machine specific code that you can really look at register use. In any case, ptxas (compiling PTX to the machine-specific code) tells you the amount of spills. — Tom, Sep 12 '12 at 13:33
compiler also uses constant memory for numeric constants (if they are too large to be hardcoded in the instruction opcode). Though I am not sure why cmem is repeated twice if you say that you do not use constant memory yourself in the program — , Sep 12 '12 at 13:34

Tom · Answer 1 · 2012-09-12T16:31:08.010

17

Each CUDA thread is using 46 registers? Yes, correct
There is no register spilling to local memory? Yes, correct
Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
What is the difference between 0 byte spill stores and 0 bytes spill loads?
- Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
- Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.

edited Sep 12 '12 at 16:31

answered Sep 12 '12 at 13:31

Tom

20,852
4
42
54

3

Note that spill loads and stores are counted statically, i.e. the number of local load and local store instructions multiplied by the width of access of each load/store. They are normalized to bytes because the compiler may be able to vectorize spill loads/stores if it has enough information about alignment and register allocation allows it. As the counts are static, this is not directly a measure of traffic for spills, since spills/fills may be inside loops. Spill loads can exceed spill stores if there is re-use of the spilled data. This would imply that spill load bytes >= spill store bytes. – njuffa Sep 12 '12 at 15:16
2

Thanks @njuffa - excellent points. The compiler cannot know trip-counts for loops (unless compile-time constant). The best way to really analyse the cost of spill/fill is to use a profiler such as Nsight (or standalone NVVP), which will given you data based on execution rather than compilation. – Tom Sep 12 '12 at 16:08
Agreed with regard to profiling. The compiler's spill statistics are mildly useful as first-line indicators. If there is no spilling, nothing to worry about. If the numbers are small (e.g. < 32 bytes) the L1 cache should take care of them without performance impact (remember the numbers are per thread, as thread-local memory is used for spills). If the numbers are in the thousands a negative peformance impact is likely and it may be time for more detailed analysis. – njuffa Sep 12 '12 at 21:11

Interpreting the verbose output of ptxas, part I

1 Answers1

Linked