First, let's sort out some terminology.
occupancy (in CUDA) refers to the extent to which a kernel uses the compute resources of a GPU. It has no concept of "time" and is something that can be statically inspected (using the CUDA occupancy calculator). It is not used to refer to "GPU utilization".
utilization (in CUDA) refers to the percentage of time, over some sampling interval, during which a CUDA kernel was running on the GPU. It tells you nothing about what resources are in use
(memory utilization of course refers to how much memory is a CUDA kernel using at a particular time)
The profiler can report occupancy. To my knowledge, the various gpu measurements in the windows task manager are based largely on utilization.
In my view it makes little sense to compare an occupancy report from the profiler to a utilization measurement. Hopefully that is now clear.
Now that you have clarified that your code (which you haven't shown) written in CUDA C++ achieves a high number in the windows task manager display (we don't really know which measurement you are referring to, but let's leave that aside), and your "equivalent" numba code doesn't, we can ask why?
To prepare a test case, your code is missing an import statement:
from numba import cuda
When I profile your code as-is on a GTX 960 GPU, we see the following:
nvprof --print-gpu-trace python t78.py
... (some items clipped out)
5.74555s 155.52us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [89992]
5.74571s 1.1840us - - - - - 20B 16.109MB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
5.74647s 1.0240us - - - - - 20B 18.626MB/s Pageable Device NVIDIA GeForce 1 7 [CUDA memcpy HtoD]
5.74655s 157.28us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90009]
5.74672s 1.1840us - - - - - 20B 16.109MB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
5.74748s 1.0240us - - - - - 20B 18.626MB/s Pageable Device NVIDIA GeForce 1 7 [CUDA memcpy HtoD]
5.74756s 155.01us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90026]
5.74772s 1.1840us - - - - - 20B 16.109MB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
5.74848s 1.0240us - - - - - 20B 18.626MB/s Pageable Device NVIDIA GeForce 1 7 [CUDA memcpy HtoD]
5.74856s 156.35us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90043]
5.74872s 1.1830us - - - - - 20B 16.123MB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
5.74890s 5.6640us - - - - - 32.000KB 5.3880GB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
The above is the tail-end of the trace output, and we can make a few observations.
- Your code as posted is doing a HtoD and DtoH memory transfer associated with each kernel invocation.
- The kernel itself appears to be taking about 150us to run.
- The time duration from one kernel launch to the next appears to be around 1000us
- Although I don't show it,
nvidia-smi
reports about 16% utilization while the code is running.
Utilization is calculated as the percentage of time that a kernel is actually running. If kernels are launched 1000us apart, and each kernel takes 150us to run, then the utilization should be 150/1000 = 15%, which is close the nvidia-smi
report.
If I wanted to increase the utilization, I would do probably 3 things:
- remove the HtoD and DtoH copies that are happening at each kernel call. These are triggered by the fact that
stride
is a host array, and numba automatically schedules transfers for each host array, before and after a kernel launch, so that that data is available to device code. It's hard to imagine that you did this in your CUDA C++ code, so I'm guessing this is one source of difference in the utilization measurement
- remove the
cuda.synchronize()
This is only a small factor, but it seems unnecessary to me.
- Increase the work done by the kernel, thus making the kernel duration longer.
We could trivially implement step 1 above by declaring a device array:
d_stride = cuda.to_device(stride)
before the kernel launch, and modify the kernel launch to use d_stride
in place of stride
.
If I only perform step 2 above, I witness about a 1% increase in utilization reported by nvidia-smi
. If I do steps 1 and 2 above, I witness this new profiler output (tail end):
nvprof --print-gpu-trace python t78.py
... (some items clipped out)
2.50316s 140.93us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25025]
2.50352s 141.19us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25030]
2.50388s 141.95us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25035]
2.50428s 142.21us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25040]
2.50464s 141.28us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25045]
2.50504s 142.82us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25050]
2.50540s 139.78us (256 1 1) (32 32 1) 27 4.0000KB 0B - - - - NVIDIA GeForce 1 7 cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25055]
2.50562s 6.0480us - - - - - 32.000KB 5.0459GB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
The kernel execution duration hasn't changed much (I wouldn't expect it to), it is still about 140us. However the intervening HtoD and DtoH copies are gone, and now the time from one kernel launch to the next is ~360us. Computing utilization this way is 140/360 = 38.9% and indeed nvidia-smi
reports 39% utilization when this code is running.
That's still not 100% utilization for what should be back-to-back kernel launches. I haven't created a C++ version of this code, but I'm fairly confident based on my experience that it would be possible to achieve a utilization above 90%.
The remaining issue we have now is that it seems the closest we can launch two kernels in numba is about 360us according to this test case, whereas I'm confident in C++ that number could be 40us or less. At any launch overhead less than ~140us kernel duration, the kernel execution is likely to become "back-to-back" translating to approximately 100% utilization. Your data for code you haven't shown seems to confirm this.
What to do?
Even in CUDA C++, a very short kernel, launched back-to-back, may still not achieve 100% utilization, if the kernel duration is shorter than the launch overhead. The solution? Devise kernels that sufficiently saturate the GPU (this gets back to occupancy) and have enough work to do so that the kernel duration is significantly longer than the launch overhead.
Still not satisfed? Make sure you are using up-to-date versions of numba and CUDA, and if the launch overhead is still a problem, file a numba issue, but of course they cannot make launch overhead completely disappear.
(I'm not showing it here, but if we change our profiling switch from --print-gpu-trace
to --print-api-trace
we can gather some fairly convincing evidence that the underlying CUDA API that numba CUDA is using, is not to blame for the majority of this 360us launch overhead.)