Numba failing to use the full GPU

Question

I recently started to use Numba as part of an assignment for college, the point was to make a comparison on performance of a specific GPU-parallelized code between Numba and Cuda-C. I've written the code in Cuda-C and it worked very well, I even checked with Nsight for GPU occupancy. But when I moved it to Numba, doing the proper adjustments, my task manager indicates that the code uses only a fraction of the available GPU (20~30%).

I proceeded to test some standard codes such as Matrix Multiplication and it worked fine on Numba. More interestingly, doubling the data size on my code won't double the GPU occupancy.

Does anyone have an idea on how to solve this issue?

Additional information:

I'm using a GTX 1650 (4GB) on a Lenovo Ideapad GAMING 3i; Running the code on Spyder 5.1.5, via Anaconda; Python Version 3.8; Numba Version 0.5.4.1; I've already reinstalled anaconda,python,numba but the results remain the same;

Below I present the kernel that is not working, It's a simple Reduce algorithm with an extra step that squares all the terms before summing:

from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from numpy import float32
import random as rnd
import sys
import time

N = 1024;

n = 32;

stride = [];

for i in range(5):
    a = n//2**(i+1);   
    stride.append(a);

stride = np.array(stride,dtype = np.int32);    
#stride = np.array(stride,dtype=np.int32);


n_particles = 8*1024;
n = 32;

@cuda.jit
def sphere(d_pos,cost,n,stride):
    
    index = cuda.threadIdx.y;

    i = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x;
    
    p = cuda.blockDim.y * cuda.threadIdx.x;

    #definindo o vetor de memoria compartilhada
    #memoria máxima disponível: 48 kBytes/SM
    sharray = cuda.shared.array(N,float32);

    #por segurança, garantir que o indice está dentro dos valores permitidos
    if (index < n): 

        #movendo os dados da memoria global para a memoria compartilhada
        sharray[index + p] = d_pos[index + p + N* cuda.blockIdx.x];
        sharray[index + p] *= sharray[index + p];
        
        cuda.syncthreads();

        #algoritimo de REDUCE para calcular "cost"!
        for std in range(len(stride)): 
            
                if (index < stride[std]): 
                        sharray[index + p] += sharray[index + p + stride[std]];
            
        
        
        cuda.syncthreads();

        #retornando o valor de "cost"
        if (index % n == 0):
            cost[i] = sharray[cuda.threadIdx.x* n];
            

d_pos = cuda.to_device(np.ones((n_particles*n),dtype = np.float32));
d_vel = cuda.to_device(np.ones((n_particles*n),dtype = np.float32));

cost = cuda.to_device(np.zeros(n_particles,dtype = np.float32));

B =  int(((n*n_particles-1)/1024 +1));

t0=time.time();

for i in range(2000):     
    
    sphere[(B,1),(32,32)](d_pos,cost,n,stride);
    cuda.synchronize();
    
print(cost.copy_to_host());
print(time.time()-t0);

Because Numba JIT is dynamically typed, arguments to the function call determine what code is emitted and run. You need to provide a complete example with the kernel call at a bare minimum before anyone could answer your quesiton — talonmies, Nov 06 '21 at 01:30
I'm confused. You have 8192 particles and 1024 threads per block. That is only 8 blocks. There is no way that small amount of work can fully utilize a GPU like the one you are using. 20-30% sounds perfectly reasonable for a GPU that probably has 16 SM and will be able to run 2 blocks per SM. That would be 8/32 = 25% utilization. Sounds reasonable to me — talonmies, Nov 06 '21 at 02:52
task manager is not a measure of occupancy. You are comparing apples and oranges. Run the nsys profiler on your numba case and check occupancy, the same way you did in the CUDA C++ case. — Robert Crovella, Nov 06 '21 at 03:13
@talonmies actualy the number of blocks is B = 256, yes there are 8192 particles but each has 32 components, totaling a data size of 262,144 float32 elements — Arthur Eckert Rüdiger, Nov 06 '21 at 14:35
@RobertCrovella but when I run the Cuda C version my task manager accuratly reads the 100% occupancy — Arthur Eckert Rüdiger, Nov 06 '21 at 19:32

Robert Crovella · Answer 1 · 2021-11-06T23:09:53.273

First, let's sort out some terminology.

occupancy (in CUDA) refers to the extent to which a kernel uses the compute resources of a GPU. It has no concept of "time" and is something that can be statically inspected (using the CUDA occupancy calculator). It is not used to refer to "GPU utilization".

utilization (in CUDA) refers to the percentage of time, over some sampling interval, during which a CUDA kernel was running on the GPU. It tells you nothing about what resources are in use

(memory utilization of course refers to how much memory is a CUDA kernel using at a particular time)

The profiler can report occupancy. To my knowledge, the various gpu measurements in the windows task manager are based largely on utilization.

In my view it makes little sense to compare an occupancy report from the profiler to a utilization measurement. Hopefully that is now clear.

Now that you have clarified that your code (which you haven't shown) written in CUDA C++ achieves a high number in the windows task manager display (we don't really know which measurement you are referring to, but let's leave that aside), and your "equivalent" numba code doesn't, we can ask why?

To prepare a test case, your code is missing an import statement:

from numba import cuda

When I profile your code as-is on a GTX 960 GPU, we see the following:

nvprof --print-gpu-trace python t78.py
... (some items clipped out)
5.74555s  155.52us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [89992]
5.74571s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74647s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74655s  157.28us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90009]
5.74672s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74748s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74756s  155.01us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90026]
5.74772s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74848s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74856s  156.35us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90043]
5.74872s  1.1830us                    -               -         -         -         -       20B  16.123MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74890s  5.6640us                    -               -         -         -         -  32.000KB  5.3880GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

The above is the tail-end of the trace output, and we can make a few observations.

Your code as posted is doing a HtoD and DtoH memory transfer associated with each kernel invocation.
The kernel itself appears to be taking about 150us to run.
The time duration from one kernel launch to the next appears to be around 1000us
Although I don't show it, nvidia-smi reports about 16% utilization while the code is running.

Utilization is calculated as the percentage of time that a kernel is actually running. If kernels are launched 1000us apart, and each kernel takes 150us to run, then the utilization should be 150/1000 = 15%, which is close the nvidia-smi report.

If I wanted to increase the utilization, I would do probably 3 things:

remove the HtoD and DtoH copies that are happening at each kernel call. These are triggered by the fact that stride is a host array, and numba automatically schedules transfers for each host array, before and after a kernel launch, so that that data is available to device code. It's hard to imagine that you did this in your CUDA C++ code, so I'm guessing this is one source of difference in the utilization measurement
remove the cuda.synchronize() This is only a small factor, but it seems unnecessary to me.
Increase the work done by the kernel, thus making the kernel duration longer.

We could trivially implement step 1 above by declaring a device array:

d_stride = cuda.to_device(stride)

before the kernel launch, and modify the kernel launch to use d_stride in place of stride.

If I only perform step 2 above, I witness about a 1% increase in utilization reported by nvidia-smi. If I do steps 1 and 2 above, I witness this new profiler output (tail end):

nvprof --print-gpu-trace python t78.py
... (some items clipped out)
2.50316s  140.93us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25025]
2.50352s  141.19us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25030]
2.50388s  141.95us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25035]
2.50428s  142.21us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25040]
2.50464s  141.28us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25045]
2.50504s  142.82us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25050]
2.50540s  139.78us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25055]
2.50562s  6.0480us                    -               -         -         -         -  32.000KB  5.0459GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

The kernel execution duration hasn't changed much (I wouldn't expect it to), it is still about 140us. However the intervening HtoD and DtoH copies are gone, and now the time from one kernel launch to the next is ~360us. Computing utilization this way is 140/360 = 38.9% and indeed nvidia-smi reports 39% utilization when this code is running.

That's still not 100% utilization for what should be back-to-back kernel launches. I haven't created a C++ version of this code, but I'm fairly confident based on my experience that it would be possible to achieve a utilization above 90%.

The remaining issue we have now is that it seems the closest we can launch two kernels in numba is about 360us according to this test case, whereas I'm confident in C++ that number could be 40us or less. At any launch overhead less than ~140us kernel duration, the kernel execution is likely to become "back-to-back" translating to approximately 100% utilization. Your data for code you haven't shown seems to confirm this.

What to do?

Even in CUDA C++, a very short kernel, launched back-to-back, may still not achieve 100% utilization, if the kernel duration is shorter than the launch overhead. The solution? Devise kernels that sufficiently saturate the GPU (this gets back to occupancy) and have enough work to do so that the kernel duration is significantly longer than the launch overhead.
Still not satisfed? Make sure you are using up-to-date versions of numba and CUDA, and if the launch overhead is still a problem, file a numba issue, but of course they cannot make launch overhead completely disappear.

(I'm not showing it here, but if we change our profiling switch from --print-gpu-trace to --print-api-trace we can gather some fairly convincing evidence that the underlying CUDA API that numba CUDA is using, is not to blame for the majority of this 360us launch overhead.)

Thank you a lot , you really helped me! This kernel is part of a Particle Swarm Optimization, which as I was trying to compare. When I made the implementation on C I made it all in small kernells in order to make it more organized but it seemed to run fine so I thought symply rewriting it to Numba would be enough, never thought there'd be that much more overhead. I tryed to throw the loop inside the kernel and the utilization peaked to 100%, which proves your point. I'll be writting the whole code as a single kernell in order to minimize overhead. Reggards Arthur — Arthur Eckert Rüdiger, Nov 07 '21 at 01:05
@Robert I'm not sure if it is not mentioned here, or I missed it! How can I have an estimate of the utilization using `nvidia-smi` ? — Andreas Hadjigeorgiou, Nov 16 '21 at 14:25
suggestion: in the search box at the top of this page, put in the text `nvidia-smi utilization` and then press enter. I think you'll find useful info. Also, the `nvidia-smi` utility has command line help available. `nvidia-smi --help` — Robert Crovella, Nov 17 '21 at 02:00

Numba failing to use the full GPU

1 Answers1