6

I just need to clarify something very basic - with most of the computational examples using something like:

ID = blockIdx.x*blockDim.x+threadIdx.x;

// ... then do computation on array[ID]

My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x, threadIdx.y, threadIdx.z) ?

If so, what is a recommended way to hash it into a single value?

If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:

https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix

How about blockidx.x and blockidx.y, are they in the same shoes as the threaIdx in this regard?

Community
  • 1
  • 1
Stochastika
  • 305
  • 1
  • 2
  • 14

4 Answers4

10

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.

Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).

For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y or threadIdx.z if you don't want to.

If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:

 int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);  

The above construct should handle 1D threadblocks with any 2D grid.

There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
1

This is on top of Robert Crovella's answer:

Creating 2D/3D grid is not only just for readability, but also for exploiting 2D/3D locality in the on-chip shared memory, which provides much faster accesses. You can not exploit such locality efficiently with 1D grid, if your problem works on a 2D grid.

user2030440
  • 81
  • 1
  • 1
0

threadID is a misleading term in your example. The value calculated is actually an index into an array that the current thread will read or write. If your kernel is invoked with multiple blocks, you need to calculate the index in this way to process every array element once.

shoelzer
  • 10,648
  • 2
  • 28
  • 49
0

Remember that the way you hash the threadIdx.x, threadIdx.y, blockIdx.x and blockIdx.y into a single value affects the coalescence of the global memory accesses you are performing, see harrism's answer in this thread

CUDA coalesced access to global memory

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146