How to properly add in global memory in CUDA?

Question

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results. I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.

Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.

__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD,  uint32 cuda_Blocksize)                                                                                                                  
 {                                                                                                                                                                                                                                                                                                                                                               
     int idx = blockIdx.x * blockDim.x + threadIdx.x;                                                                                                                                                                                    
     int idy = blockIdx.y * blockDim.y + threadIdx.y;                                                                                                                                                                                    
     int id = idx * cuda_Blocksize + idy;                                                                                                                                                                               
     int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );                                                                                                     
     __syncthreads();                                                                                                                                                                                 
     if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {                                                                                                                                                                           
         atomicAdd( cuda_SAD, AD );                                                                                                                                                                                              
     }                                                                                                                                                                                                                                   
 }

And this is how I'm setting up the grid and blocks for the kernel:

int grid_sizeX   = Blocksize/2;                                                                                                                                                                                                
int grid_sizeY   = Blocksize/2;                                                                                                                                                                                                     
int block_sizeX  = Blocksize/4;                                                                                                                                                                                                     
int block_sizeY  = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);                                                                                                                                                                                     
dim3 threadsInBlock(block_sizeX, block_sizeY);

The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?

Whats the difference between 'Blcoksize' and 'block_size'? The naming looks confused. One is the size of CUDA's thread-block and the other is the size of the image ROI? — kangshiyin, Sep 25 '13 at 05:57
Ah yeah, sorry. Those are the names provided on the skeleton implementation by our professor. Blocksize refers to the X/Y size of the square box of pixels we are summing in the input image. So Blocksize*Blocksize is the size of of the box we are calculating for. block_size refers to the size of the blocks of CUDA threads. — PseudoPsyche, Sep 25 '13 at 06:01
Perhaps you should give the full code. And i think you have to check your block and thread configuration. For example, for an image of 100x100, suppose Blocksize = 100 then, grid_sizeX = grid_sizeY = 50 and block_sizeX = block_sizeY = 25 which gives total thread of 25*25*50*50 = 1562500 (not equal to 100x100) — Sagar Masuti, Sep 25 '13 at 06:09

score 1 · Answer 1 · edited May 23 '17 at 12:31

You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.

Are you doing proper cuda error checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
```
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
```

529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.

If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

Yes, this was the case. I modified the size of the blocks and grid and now it runs fine for all input. Thank you. I will definitely look into the error checking more. The reason there is none other places in the program is because we were instructed not to modify any other portion of the code that was given to us. My professor did not include that error checking in this code. — PseudoPsyche, Sep 25 '13 at 06:25

kangshiyin · Accepted Answer · 2013-09-25T06:54:37.597

1

Your grid and block size settings looks odd.

Usually we use the settings for image pixels similar as follows.

int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);

You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

edited Sep 25 '13 at 06:54

answered Sep 25 '13 at 06:06

kangshiyin

9,681
1
17
29

Okay, I changed to this and it works fine for all inputs of different ROI now. Could you maybe explain the logic behind the dimensions you chose here? I'm still trying to learn how to select good dimensions for setting this up. Thank you! – PseudoPsyche Sep 25 '13 at 06:26
@PseudoPsyche that will be a long story. I added a link in my answer for you. – kangshiyin Sep 25 '13 at 06:56

How to properly add in global memory in CUDA?

2 Answers2