CUDA kernel not updating the output data

Question

Ok, so the main idea of the task is to calculate the average of multiple images, I have it running in the normal way so I thought I will give it a go using CUDA, but unfortunately what I receive in the output is the first image instead of the average. (Inside the Kernel I also tried to set some pixels to 0 to make sure something is happening but no luck..)

////My kernel:
//nImages - number of images in the memory
//nBytes - number of pixels*color per image (also it's a size of dataOut)
//nImages*nBytes gives us the size of dataIn 
//nBatch - dataIn has 1 milion bytes per image, we run in 6144 threads, so we need 163 batches to calc the whole dataOut
__global__ 
void avg_arrays(unsigned char* cuDataIn, unsigned char* cuDataOut, int nImages, int nBytes, int nBatch) 
{
   //get the position of the correct byte
   int j = threadIdx.x +  nBatch;
   //if we're outside of image then give up
   if(j >= nBytes) return;
   //proceed averaging
   long lSum = 0;
   for(int i=0; i < nImages; ++i) 
      lSum += cuDataIn[i*nBytes + j];
   lSum = lSum / nImages;
   cuDataOut[j] = lSum;
}

Memory allocation etc.

unsigned char* dataIn = 0;
unsigned char* dataOut= 0;

// Allocate and Transfer memory to the devicea
gpuErrchk( cudaMalloc((void**)&dataIn, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char)));                                  //dataIn
gpuErrchk( cudaMalloc((void**)&dataOut, nPixelCountBGR * sizeof(unsigned char)));                               //dataOut
gpuErrchk( cudaMemcpy(dataIn, bmps,  nPixelCountBGR * nNumberOfImages * sizeof(unsigned char), cudaMemcpyHostToDevice ));           //dataIn
gpuErrchk( cudaMemcpy(dataOut, basePixels, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyHostToDevice ));   //dataOut

// Perform the array addition
dim3 dimBlock(N);  
dim3 dimGrid(1);

//do it in batches, unless it's possible to run more threads at once, anyway N is a number of max threads
for(int i=0; i<nPixelCountBGR; i+=N){
   cout << "Running with: nImg: "<< nNumberOfImages << ", nPixBGR " << nPixelCountBGR << ", and i = " << i << endl;
   avg_arrays<<<dimGrid, dimBlock>>>(dataIn, dataOut, nNumberOfImages, nPixelCountBGR, 0);
}
// Copy the Contents from the GPU
gpuErrchk(cudaMemcpy(basePixels, dataOut, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyDeviceToHost)); 

gpuErrchk(cudaFree(dataOut));
gpuErrchk(cudaFree(dataIn));

The error check doesn't bring any messages, all the code runs smoothly, all I get at the end is the exact copy of the first image.

Just in case if someone needs here's some console output:

Running with: nImg: 29, nPixBGR 1228800, and i = 0
...
Running with: nImg: 29, nPixBGR 1228800, and i = 1210368
Running with: nImg: 29, nPixBGR 1228800, and i = 1216512
Running with: nImg: 29, nPixBGR 1228800, and i = 1222656
Time of averaging: 0.219

The basic problem is that you are averaging the `i=0` part over and over again. — Jeffrey Sax, Nov 12 '13 at 16:46
Jeffrey, can you say something more? Each time I'm starting averaging for a different batch of 6144 pixels and go through all the images stored in cuDataIn (first image start from 0, second from 1228800..). Inside the kernel I attempt to go through all of the images I currentyly keep in memory (29)? — Pete Kozak, Nov 12 '13 at 16:50
What is `N` ? You're not doing error checking correctly on the kernel call. If N is greater than 512 or 1024, your kernel is not running. — Robert Crovella, Nov 12 '13 at 16:52
Jeffrey, I've spotted that silly 0 at the end ("..., nPixelCountBGR, 0);") - updated it to "i". - But still no luck :/ Robert, N was set to 6144 in the follwoing way: #define N 6144 Ive tried to change it to lower number than 512 and still no effect, I receive the first image on the output. — Pete Kozak, Nov 12 '13 at 16:56
@PeteKozak You may want to learn more about blocks and grids. You're cutting up your data into chunks manually here, and calling a kernel for each of your chunks, but you can have it all done automatically in one kernel launch. Udacity's course is great, and free: https://www.udacity.com/course/cs344. — Jeffrey Sax, Nov 12 '13 at 19:24

score 1 · Accepted Answer · edited May 23 '17 at 12:19

1

If N is greater than 512 or 1024 (depending on which GPU you are running on, which you don't mention), then this is invalid:

dim3 dimBlock(N);

because you can't launch a kernel with greater than 512 or 1024 threads per block:

 avg_arrays<<<dimGrid, dimBlock>>>(...
                          ^
                          |
                     this is limited to 512 or 1024

If you study proper cuda error checking and apply it to your kernel launch, you'll trap this kind of error.

edited May 23 '17 at 12:19

Community

1
1

answered Nov 12 '13 at 16:58

Robert Crovella

143,785
11
213
257

OK this is fixed now. That was a combination of two errors: a) spotted by Jeffrey, instead of passing "i" I was passing 0 and was looping over and over again through the same image b) spotted by Robert, the decreased value of N to 256 brought the expected result. Thanks a lot, you've saved my day! – Pete Kozak Nov 12 '13 at 17:02

CUDA kernel not updating the output data

1 Answers1