0

When writing a kernel that will operate either on one element of a 1D array, with a large dimension N that can be from 2 to 100s of millions, or on one element with the same index from several 1D arrays with the same dimension N, what is the optimal way to compute the grid size for the kernel? In particular, is the following the optimal way:

CPU part:

cudaSetDevice(device);

cudaGetDeviceProperties(&deviceProp, device);

max_threads_per_block = deviceProp.maxThreadsPerBlock;
number_CUDA_cores = _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor) * deviceProp.multiProcessorCount;

GPU part:

if (N <= number_CUDA_cores) {
  blocks            = N;
  threads_per_block = 1;
}
else {
  blocks            = number_CUDA_cores;
  threads_per_block = (N + number_CUDA_cores - 1) / number_CUDA_cores;

  if (threads_per_block > max_threads_per_block) {
    blocks            = (N + max_threads_per_block - 1) / max_threads_per_block;
    threads_per_block = max_threads_per_block;    
  }
}

kernel<<<blocks, threads_per_block>>>( ..... )

Again, this has to work well for sizes of N from 2 to 100s of million on a K20 or K20x, and better.

Will the answer be different if:

1) the kernel is mostly straight code

2) the kernel has many if-statements, and they can diverge in execution (that is why I am trying to keep the blocks as small as possible).

user1760748
  • 149
  • 2
  • 9
  • 4
    There is no easy method to determine optimal launch configuration. Occupancy is a balancing act between different static resource allocations (shared memory, registers, warps, and blocks) and runtime resources such as issue slots, i-cache, and d-cache. Start off with a launch configuration that gives you 50% occupancy. Shared memory often dictates the maximum number of blocks. Registers per thread often limits warps. If you have a lot of syncthreads then consider at least 2 if not 4 blocks per SM. If problem space is larger use less warps. CUDA cores should never be used in the algorithm. – Greg Smith Mar 20 '13 at 22:29
  • If you have a lot of branching in your code I'd say isn't suitable for a GPU. You have to obey to _SIMD_ in order to get maximum/good performance. – KiaMorot Mar 21 '13 at 07:39

0 Answers0