1

Let's suppose I have a kernel call with a 2D grid, like so:

dim3 dimGrid(x, y); // not important what the actual values are
dim3 dimBlock(blockSize, blockSize);
myKernel <<< dimGrid, dimBlock >>>();

Now I've read that multidimensional grids are merely meant to ease programming - the underlying hardware will only ever use 1D linearly cached memory (unless you use texture memory, but that's not relevant here).

My question is: In what order will the threads be assigned to the grid indices during warp scheduling? Will they be assigned horizontally ("iterate" x, then y) or vertically ("iterate" y, then x)? This might be relevant to improve memory coalescing, depending on how I access my memory in the kernel.

To make it more clear, let's say the following represents the thread's IDs as applied to my (imaginary) grid with a "horizontal" distribution:

[ 0  1  2  3 ]
[ 4  5  6  7 ]
[ 8  9 10 11 ]
[ ...        ]

And "vertical" distribution would be:

[ 0  4  8 .. ]
[ 1  5  9 .. ]
[ 2  6 10 .. ]
[ 3  7 11 .. ]

I hope you can see how this might affect coalescing: With each variant, there will be a specific optimal way to access my device memory buffer.

Unfortunately, I have not found any detailed information on this yet..

lethal-guitar
  • 4,438
  • 1
  • 20
  • 40
  • 1
    Read CUDA C Programming Guide Section 2.2, Thread Hierarchy. – sgarizvi Jan 10 '13 at 12:24
  • possible duplicate of [How are threads divided into warps? Cuda](http://stackoverflow.com/questions/6177202/how-are-threads-divided-into-warps-cuda) – talonmies Jan 10 '13 at 12:35

1 Answers1

3

Horizontal and vertical is arbitrary. But threads do have a well-defined x, y, and z dimension. Threads are grouped into warps in the order of x, y, z. So a 16x16 threadblock will have threads in the following order in the first 32-thread warp:

warp lane: thread ID (x,y,z)

  • 0: 0,0,0
  • 1: 1,0,0
  • 2: 2,0,0
  • 3: 3,0,0
  • ...
  • 15: 15,0,0
  • 16: 0,1,0
  • 17: 1,1,0
  • 18: 2,1,0
  • 19: 3,1,0
  • ...
  • 31: 15,1,0

The above is also the exact pattern you would have for a threadblock dimension of 16,2 which would fill one warp.

For a dimension that does not fill a warp completely, such as 8,3 (or 8,29, which will leave the final warp only partially full of active threads), the assignment order follows the same pattern. The rapidly varying dimension as you assign threads to warps is the x dimension, just as you see above. The next most rapidly varying dimension is y, then z.

The programming guide also explains how to number threads in order. When threads are numbered in that order, the first 32 threads belong to the first warp, the next 32 threads belong to the next warp, etc.

If there is only a partial complement (i.e. less than 32) of threads available for a particular last warp in the above numbering order, then the last warp (only) will consist of fewer than 32 active threads.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257