1

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.

I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.

My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.

Gaël Barbin
  • 3,769
  • 3
  • 25
  • 52

3 Answers3

1

I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.

The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.

Here are some limitations which you can consider.

Register usage in your kernel. Occupancy of your current implementation.

Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.

Hong Zhou
  • 659
  • 1
  • 9
  • 20
1

Hong Zhou's answer is good, so far. Here are some more details:

When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain those many variables controlling parallelism. You either have blocks with many threads sharing larger regions or blocks with fewer threads sharing smaller regions (under constant occupancy).

If your code can live with as little as 16KB of shared memory per multiprocessor you might want to opt for larger (48KB) L1-caches calling

cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.

Before worrying about optimal performance based on occupancy you might also want to check that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler configuration).

Now that we have a memory configuration and registers are actually used aggressively, we can analyze the performance under varying occupancy:

The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).

The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.

Watching stall reasons, memory statistics and arithmetic throughput in the profiler while varying the launch bounds and parameters will help you find a suitable configuration.

It's theoretically possible to find optimal values from within an application, however, having the client code adjust optimally to both different device and launch parameters can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.

Community
  • 1
  • 1
Dude
  • 583
  • 2
  • 9
0

I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Community
  • 1
  • 1
Gaël Barbin
  • 3,769
  • 3
  • 25
  • 52