1

I got a question regarding the dimensions of the blocks and grids in (py)CUDA. I know that there are limits in the total size of the blocks, but not of the grids

And that the actual blocksize influences the runtime. But what I'm wondering about is: Does it make a difference if I have a block of 256 threads, to start it like (256,1) or to start it like (128,2), like (64,4) etc.

If it makes a difference: which is the fastest?

talonmies
  • 70,661
  • 34
  • 192
  • 269
user2267896
  • 173
  • 2
  • 9
  • Actually I read that thread too, however it doesn't answer my question, as the discussion is only about the total size of the block and how to determine it in an ideal case. My question is about: Does it make (a performance) difference if I take a blocksize of 256 and invoke it in an (256,1,1) or in an (128,2,1) or in an (64,2,2) fashion. Given an adapted implementation of course. – user2267896 Aug 23 '13 at 07:16

1 Answers1

3

Yes, it makes a difference.

(256,1) creates a (1D) block of 256 threads in the X-dimension, all of which have a y-index of 0.

(128,2) creates a (2D) block of 128x2 threads, ie. 128 in the x-dimension and 2 in the y-dimension. These threads will have an x-index ranging from 0 to 127 and a y-index ranging from 0 to 1

The structure of your kernel code must comprehend the thread indexing/numbering.

For example if your kernel code starts with something like:

int idx=threadIdx.x+blockDim.x*blockIdx.x;

and doesn't create any other index variables, it's probably assuming a 1D threadblock and 1D grid.

If, on the other hand, your kernel code starts with something like:

int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;

It's probably expecting a 2D grid and 2D threadblocks.

Generally speaking, the two approaches are not interchangeable, meaning you cannot launch a kernel that expects a 1D grid with a 2D grid and expect everything to work normally, and vice-versa.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 1
    That is clear. My question was more: Given the implementation ist adapted, is there a performance difference? Or is it just for the programmers convienience to be able to use a more natural way of indexing matrices, volumes, etc? – user2267896 Aug 23 '13 at 07:13
  • 4
    There is no performance difference *due to thread dimensions* if the total thread count is the same. The machine creates the same number of warps. The only difference will be the specific built-in variables (e.g. threadIdx.x, etc.) assigned to each thread. – Robert Crovella Aug 23 '13 at 08:21