1

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?

Sagar Masuti
  • 1,271
  • 2
  • 11
  • 30
Neil Locketz
  • 4,228
  • 1
  • 23
  • 34
  • Possible duplicate of [CUDA determining threads per block, blocks per grid](http://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid), [CUDA, how to choose <<>?](http://stackoverflow.com/questions/12660060/cuda-how-to-choose-blocks-threads) and perhaps [CUDA - what if I choose too many blocks?](http://stackoverflow.com/questions/5476152/cuda-what-if-i-choose-too-many-blocks). – Vitality Nov 14 '13 at 20:16

1 Answers1

2

I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.

CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.

You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.

If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).

These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257