9

In a library, I use calls to several CUDA kernels. Of course I want to get best performance. How users use the library can vary a bit.

The number of Blocks / Threads influences this significantly.

Is there some rule on how to chose Blocks / Threads for best performance?

For example (just a question), is it best to chose blocks high, threads low? Or the other way around? Or is it best to use some values from GetDeviceProperties()?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Torsten Mohr
  • 489
  • 6
  • 16

3 Answers3

0

Preferrably you want to have at least one full warp of threads in a block, otherwise you're making only poor use of the available processing power. Also you normally want to have a evenly divisible by the warp size number of threads in a block.

The toal number of threads to use in a block depends on your resource usage. In principle you want to aim for a large occupancy. The limits are set by available shared memory and registers. If you make use of a lot of shared memory and/or registers the maximum achievable occupancy drops. It then makes sense to profile and fine tune the number of threads per block until you find a sweet spot, where the ratio of achieved and theoretical occupancy maximizes, and of course also the total occupancy itself gets as close as possible to 100%.

As a rule of thumb you want to maximize the numbers of threads per block while keeping good occupancy. It makes totally sense in a profiling step to automatically iterate through the set of possible block/thread number combinations to find the extremal combination.

datenwolf
  • 159,371
  • 13
  • 185
  • 298
  • 2
    Usually an even number of warps is also preferable. On compute capability 1.x because of register allocation granularity, on higher compute capabilities in order to evenly feed the two (four on CC 3.0) warp schedulers. And often it's better to still allow for more than one block per SM, as I've just written [here](http://stackoverflow.com/a/12652166/1662425). – tera Sep 30 '12 at 10:30
  • Getting occupancy close to 100% is not necessarily the best strategy. Please see Paulius Micikevicius's performance analysis and optimization talk here: http://bit.ly/OzutxO – Mark Ebersole Oct 02 '12 at 00:40
0

you can use dependency calculator.xls, which is provided by NVIDIA for choosing[you have to try changing values of threads and blocks in xls] the best configuration, on which you can achieve best occupancy which in turn give you the best performance.

rps
  • 221
  • 2
  • 6
  • The assumption that "best occupancy which in turn give you the best performance" is normally incorrect. – talonmies Sep 30 '12 at 16:10
0

I think it's totally experience.

the block and grid size is depend on lot of things, as algorithm, work per thread, resource, latency.

In normal cases, i will make as 256*256 first. and adjust it frequently to choose a better one.

In thrust, they will choose block size like 257 to avoid bank conflicts.

there are a lot resource to help you choose. like: latency and block size (http://www.lsr.nectec.or.th/images/e/e6/Cuda_Optimization2.pdf)

Any way, Just try and update it.

luxuia
  • 3,459
  • 1
  • 12
  • 8