Below piece of code
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0); //assuming current device ID is 0
collects properties of the device into deviceProp
. As you can see here, after successful calling of cudaGetDeviceProperties
you'll be able to access deviceProp
members which have the device properties you desire. For example, deviceProp.maxThreadsPerMultiProcessor
indicates maximum number of threads per multiprocessor, deviceProp.maxThreadsPerBlock
indicates maximum number of threads per block, etc.
Appropriate number of threads per block and overall number of blocks you call your function with are mostly depending on your device properties and your program. Each block that you call occupies a portion of the SM. How much depends on the resources your block has requested: threads, registers, and shared memory.
Consider this example. Suppose your device SMs can have up to 2048 threads, 48 KB of shared memory, and 64 KB registers. If your block requires 512 threads, and at the same time uses all the shared memory and registers available to an SM, it wouldn't be possible to have another block with the same characteristics in the SM. Hence you reduce maximum achieved occupancy to 25% by not being able to use 2048 minus 512 potential SM threads. Now if you devise your block in a way that by increasing threads in a block to 1024, you can consume same amount of registers and shared memory, you have doubled the occupancy to 50%.
Having a huge number of blocks are usually not recommended. GPU schedules new blocks to available SMs. If all SMs are occupied, it queues the block until an SM has enough free resources for the block. Scheduling new blocks have overheads (although small) for GPU. It's better after you find your optimal block size, you calculate (or profile) the occupancy of the block over an SM, and then call as many blocks as it occupies all the GPU SMs. If you need more blocks, you can reuse threads of a block that have finished their job.
For example converting
GPU_kernel<<<1024,512>>>();
in which
__global__ void GPU_kernel(void){
unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
//rest of code
}
into
GPU_kernel<<<(number_of_SMs*number_of_blocks_per_SM),512>>>();
in which
__global__ void GPU_kernel(void){
unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
for (; tid < 1024*512; tid += blockIdx.x* gridDim.x ) {
//rest of code
}
}
usually results in a better performance.
Also be aware that in above pieces of code I haven't included proper CUDA error checking. Please apply your own method to handle possible errors. Instructions here.