3

I have known the ideas of block and grid in cuda, and I'm wondering if there is any helper function well written that can help me determine the best block and grid size for any given 2D image.

For example, for a 512x512 image mentioned in this thread. Grid is 64x64 and block is 8x8.

However sometimes my input image may not be power of 2, it may be 317x217 or something like that.In this case, maybe grid should be 317x1 and block should be 1x217.

So if I have an application that accepts an image from user, and use cuda to process it, how can it automatically determine the size and dimension of block and grid, where user can input any size of image.

Is there any existed helper function or class that handles this problem?

Community
  • 1
  • 1
Philip Zhang
  • 627
  • 8
  • 17

2 Answers2

6

Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.

Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.

So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like

const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
  //Do stuff
}

Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.

user3288829
  • 1,266
  • 15
  • 26
1

It depends on how you deal with the image. If your thread only process each pixel separately, for example, adding 3 to each pixel value, you can just assign one dimension to your block size and the other to your grid size (just do not out of range). But if you want to do something like filter or erode, this kind of operation often need to access the pixels near the center pixel like 3*3 of 9*9. Then the block should be 8*8 as you mentioned, or some other value. And you'd better to use the texture memory. Because when the thread access to the global memory, there always be 32 thread to be a wrap in a block one time.

So there isn't function as you described. The number of threads and blocks depends on how you process the data, it is not universal.

Kill Console
  • 1,933
  • 18
  • 15