0

this seems like a simple question but I am unable to find the answer anywhere. I have a global function I can call like this :

func<<<nbBlocks,nbThreadByBlock, nbBytesOfSharedMmy>>>(args);

If I understand correctly, I can never use more than 1024 for nbThreadByBlock, but how can I know dynamically what is the max of nbThreadByBlock that is allowed for my function func and for my GPU ?

Am I correct in thinking that, if my func function uses more local variables, the max number of threads by per block is reduced?

About the total number of blocks that I can use, is there an upper limit ? I was thinking that if I put more blocks than possible, they will be treated sequentially, is it true?

thanks!

lezebulon
  • 7,607
  • 11
  • 42
  • 73

1 Answers1

1

Below piece of code

cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0); //assuming current device ID is 0

collects properties of the device into deviceProp. As you can see here, after successful calling of cudaGetDeviceProperties you'll be able to access deviceProp members which have the device properties you desire. For example, deviceProp.maxThreadsPerMultiProcessor indicates maximum number of threads per multiprocessor, deviceProp.maxThreadsPerBlock indicates maximum number of threads per block, etc.

Appropriate number of threads per block and overall number of blocks you call your function with are mostly depending on your device properties and your program. Each block that you call occupies a portion of the SM. How much depends on the resources your block has requested: threads, registers, and shared memory.
Consider this example. Suppose your device SMs can have up to 2048 threads, 48 KB of shared memory, and 64 KB registers. If your block requires 512 threads, and at the same time uses all the shared memory and registers available to an SM, it wouldn't be possible to have another block with the same characteristics in the SM. Hence you reduce maximum achieved occupancy to 25% by not being able to use 2048 minus 512 potential SM threads. Now if you devise your block in a way that by increasing threads in a block to 1024, you can consume same amount of registers and shared memory, you have doubled the occupancy to 50%.

Having a huge number of blocks are usually not recommended. GPU schedules new blocks to available SMs. If all SMs are occupied, it queues the block until an SM has enough free resources for the block. Scheduling new blocks have overheads (although small) for GPU. It's better after you find your optimal block size, you calculate (or profile) the occupancy of the block over an SM, and then call as many blocks as it occupies all the GPU SMs. If you need more blocks, you can reuse threads of a block that have finished their job.
For example converting

GPU_kernel<<<1024,512>>>();

in which

__global__ void GPU_kernel(void){

    unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
    //rest of code
}

into

GPU_kernel<<<(number_of_SMs*number_of_blocks_per_SM),512>>>();

in which

__global__ void GPU_kernel(void){

    unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
    for (; tid < 1024*512; tid +=  blockIdx.x* gridDim.x ) {
        //rest of code
    }
} 

usually results in a better performance.

Also be aware that in above pieces of code I haven't included proper CUDA error checking. Please apply your own method to handle possible errors. Instructions here.

Community
  • 1
  • 1
Farzad
  • 3,288
  • 2
  • 29
  • 53