7

I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Michael
  • 1,834
  • 2
  • 20
  • 33

1 Answers1

8

The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same problem.

Let's take vector add as an example. Suppose my vector is of length 100000. I might choose to do this by launching 100 blocks of 1000 threads each. In this case, each block will have 1000 active threads, and 24 inactive threads. My average utilization of thread resources is 1000/1024 = 97.6%.

Now, what if I chose blocks of size 1024? Now I only need to launch 98 blocks. The first 97 of these blocks are fully utilized in terms of thread utilization - every thread is doing some thing useful. The 98th block only has 672 (out of 1024) threads that are doing something useful. The others are explicitly inactive because of a thread check (if (idx < N) ) or other construct in the kernel code. So I have 352 inactive threads in that one block. But my overall average utilization is 100000/100352 = 99.6%

So in the above scenario, it's better to choose "full" threadblocks, evenly divisible by 32.

If you are doing vector add on a vector of length 1000, and you intend to do it in a single threadblock, (both may be bad ideas), then it does not matter whether you choose 1000 or 1024 for your threadblock size.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257