Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Question

How are threads organized to be executed by a GPU?

The CUDA Programming Guide should be a good place to start for this. I would also recommend checking out the CUDA introduction from [here](https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/). — Tom, Mar 06 '10 at 19:44

score 312 · Accepted Answer · edited Apr 28 '17 at 09:27

312

Hardware

If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

Software

threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).

Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).

Now a simple case: processing a 512x512 image

Suppose we want one thread to process one pixel (i,j).

We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)

It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

dim3 threadsPerBlock(8, 8);  // 64 threads

and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.

dim3 numBlocks(imageWidth/threadsPerBlock.x,  /* for instance 512/8 = 64*/
              imageHeight/threadsPerBlock.y);

The kernel is launched like this:

myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );

Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

edited Apr 28 '17 at 09:27

meJustAndrew

6,011
8
50
76

answered Mar 06 '10 at 11:16

cibercitizen1

20,944
16
72
95

12

If each block can run 768 threads, why use only 64? If you use the max limit of 768, you will have less blocks and so better performance. – Aliza Nov 14 '11 at 10:20
10

@Aliza : blocks are *logical*, the limit of 768 threads is for each *physical* processing unit. You use blocks according to the specifications of your problem in order to distribute the work to the threads. It is not likely that you can always use blocks of 768 threads for every problem you have. Imagine you have to process a 64x64 image (4096 pixels). 4096/768 = 5.333333 blocks ? – cibercitizen1 Nov 15 '11 at 10:26
1

block are logical, but each block is assigned to a core. if there are more blocks than core, the blocks are queued until cores become free. In your example you can use 6 blocks and have the extra threads do nothing(2/3 of the threads on the 6th block). – Aliza Nov 15 '11 at 12:59
@Aliza It is not trivial to tune the settings of cuda to get the most of it. Also, remember that complex algorithms may be clearer written in terms of blocks with dimensions such that a x b x c < 768. If there was so simple to use always 768 threads x block, then the cuda designers would not have make it possible any other organization of the threads. – cibercitizen1 Nov 15 '11 at 22:49
@Aliza (cont.) The more blocks the more slow the program is? Well I would check it before saying it loudly for a given problem. There are many other things that impacts the performance of a kernel: what memory you use (global, textures, caches) and how it is accessed (coalescence). See a best practices for cuda programming guide. – cibercitizen1 Nov 15 '11 at 22:49
@Aliza (cont.) Finally, there is a limit on the memory available for the threads of each block. If each single thread uses too much "local" variables then, the number of threads in the block (blockDim) *must* be reduced. – cibercitizen1 Mar 22 '12 at 17:43
4

@cibercitizen1 - I think Aliza's point is a good one: if possible, one wants to use as many threads per block as possible. If there is a constraint that requires fewer threads, better to explain why that might be the case in a second example (but still explain the simpler and more desirable case, first). – Nov 19 '12 at 21:08
6

@thouis Yes, maybe. But the case is that the amount of memory needed by each thread is application dependent. For instance, in my last program, each thread invokes a least-square optimizing function, requiring "a lot" of memory. So much, that blocks can't be bigger than 4x4 threads. Even so, the speedup obtained was dramatic, vs the sequential version. – cibercitizen1 Nov 22 '12 at 11:04
can you explain what to do when your image size isn't evenly divisible by 64. For image size 511x511/64 = 4080.01 – MySchizoBuddy Oct 12 '13 at 20:41
1

@MySchizoBuddy You should pad the image, adding pixels, (or trim it, removing pixels) so it fits a power of 2 dimension. – cibercitizen1 Oct 13 '13 at 17:23
When the image size is not evenly divisible by 64 it is not a problem to use whatever block size is suitable in general while leaving the straggler edge blocks on the boundaries with fewer threads assigned, this is going to utilize the GPU better because if these edgecase blocks turn out tiny from time to time, then more than one block can become assigned to SM's so they will lead to better hardware utilization. Do not needlessly add a potentially huge amount of extra workload just to make your image meet some multiple of dimension or power of 2! – Steven Lu May 13 '20 at 18:28

Bizhan · Answer 2 · 2020-05-13T17:59:35.747

11

Suppose a 9800GT GPU:

it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)

https://www.tutorialspoint.com/cuda/cuda_threads.htm

A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:

func1();
__syncthreads();
func2();
__syncthreads();

then the kernel must run twice and the order of execution will be:

func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads

Note:

The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.

I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.

Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.

I would greatly appreciate if someone test this again for a more accurate piece of information.

edited May 13 '20 at 17:59

answered Jun 14 '10 at 06:25

Bizhan

16,157
9
63
101

1

What happens if func2() depends on the results of func1(). I think this is wrong – Chris Jan 04 '17 at 23:01
@Chris I wrote this seven years ago, but if I recall correctly i did a test on this and got this conclusion that kernels with more threads than gpu behave this way. If you happen to test this case and reached a different result then I'll have to delete this post. – Bizhan Jan 04 '17 at 23:41
Sorry I think this is wrong, also, that GPU can only concurrently run 112 threads. – Steven Lu May 11 '20 at 01:09
@StevenLu have you tried it? also I don't think 112 concurrent threads make any sense for a GPU. 112 is the number of stream processors. I can hardly remember CUDA now :) – Bizhan May 12 '20 at 15:15
14 x 8 = 112 is the number of stream processors not the threads. – Bizhan May 12 '20 at 15:25
There are 14 SM's in that GPU, each of which contains 8 CUDA cores. Yes you can dispatch thousands, or millions of threads to any GPU, the blocks that they're organized in will get assigned to SM's, but at any one point in time, in each S,M 8 of the threads could be executed at a time. – Steven Lu May 12 '20 at 16:25
@StevenLu please read page 40 middle paragraph: http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_BestPracticesGuide_3.1.pdf – Bizhan May 12 '20 at 17:09
Yup. There are a lot of tricky nuanced distinctions between the use of words, and I'm sure I wasn't using proper terminology. The register files per SM are very large, and will hold the registers of the warps which are assigned to it. Since it has a fixed size such as 32KB (for compute capability 1 & 1.1), although the max number of warps that may be assigned to it is specified as 24, as the linked paper says, if those warps use more than 10 registers then fewer than 24 can fit in that 32KB register file. I was talking about how at one moment in time only N threads can actually get ... – Steven Lu May 12 '20 at 22:46
... executed, as there are only N=112 FP32 execution units (CUDA cores) on the entire 10-year-old GPU! I'm trying to point out that you are using these very specific numbers such as max threads that can be assigned to a SM as meaning that that many warps are concurrently running, which is not the case, although SOME way of thinking about it exists such that you could twist those words to mean that. Indeed for all 24 warps that might get assigned to a given SM they will all gradually get processed until they finish before they can be evicted. – Steven Lu May 12 '20 at 22:49
@StevenLu thanks for the explanation, I see your point now, I editing my answer to address that. However, I think the actual concurrent threads will be 1120, according to [this article](https://streamhpc.com/blog/2017-01-24/many-threads-can-run-gpu/) The maximum number of active/running threads is equal to the number of cores on a GPU times 10. – Bizhan May 13 '20 at 07:50
Yeah the notion of a “running” or “active” thread is fraught with ambiguity for GPU architecture... that article is a pretty good one it seems. But I’ve only skimmed it, not sure where you’re getting the factor of 10 from. On newer arch’s, up to 64 warps worth of state can be placed into the register file of one of the processing subunits of a SM... all those 64*32 threads can be said to be “running”, or less controversially to me said to be “active”, but the processing subunit only will have 32 or 48 cuda cores. – Steven Lu May 13 '20 at 11:32
@StevenLu it seemed weird to me so I left a comment for the auther. To find where I found the idea just search for this text in the article: *The maximum number of active/running threads is equal to the number of cores on a GPU times 10.* – Bizhan May 13 '20 at 11:51
Yes, I think the article is saying that because, a few paragraphs above: "The number of active threads per core on AMD hardware is 4 to up to 10, depending on the kernel code (key word: occupancy). This means that with our example of 1000 cores, there are up to 10000 active threads." I just don't find that streamhpc article to be particularly great... – Steven Lu May 13 '20 at 13:37
Generally the idea with GPU programming is that the way you set up your threads in a block ("Workgroup" in OpenCL) is vitally important, as that determines the mapping to the use of the shared memory (which is balls-to-the-walls fast compared to main memory). It's a bit too much to explain in comments – Steven Lu May 13 '20 at 13:38
@StevenLu point taken, I didn't read it carefully enough to notice it's talking about a particular arch. Anyway I think it's correct to say the **maximum number of threads** equals **cores times warpsize**. I assume a CUDA programmer is more interested in how the code works rather than how the hardware works behind the scenes. – Bizhan May 13 '20 at 15:09
@StevenLu back to the first issue, `__syncthreads` synchronizes all threads in a block, but since a block cannot contain more threads than this limit, it needs to re-execute the kernel to handle more. I believe this was the result of my experiment a decade ago. – Bizhan May 13 '20 at 15:27
Yes, the StreamHPC article you linked did state that the maximum number of threads is probably at least 4 billion, so I'd go with that. There should never be an upper limit to the number of threads. Sorry that I wasn't clear in my feedback earlier, that's the main issue I have with your answer. you're using architecture specific values to make recommendations on # of threads, which seems pretty off-base. Roughly speaking one only needs to consider the max # threads per block which is 1024 for all recent compute capabilities in determining block sizing, and the more blocks of work, the merrier. – Steven Lu May 13 '20 at 16:39
The other thing relating to block sizing is consumption of registers in the kernel. If it is high, it will push down the number of possible threads that can fit into a block (which has to stay resident within a single SM, otherwise they cannot share usage of that SM's shared memory. – Steven Lu May 13 '20 at 16:40
1

@StevenLu the maximum number of threads is not the issue here, `__syncthreads` is a block-wide operation and the fact that it does not actually synchronize all threads is a nuisance for CUDA learners. So I updated my answer based on the information you gave me. I really appreciate it. – Bizhan May 13 '20 at 17:49
Thanks! Have an upvote :) the answer definitely reads way better now, at least based on what I think I know. Haha. Cheers! – Steven Lu May 13 '20 at 18:34

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

2 Answers2

Hardware

Software

Now a simple case: processing a 512x512 image

Linked