0

Ok, so the main idea of the task is to calculate the average of multiple images, I have it running in the normal way so I thought I will give it a go using CUDA, but unfortunately what I receive in the output is the first image instead of the average. (Inside the Kernel I also tried to set some pixels to 0 to make sure something is happening but no luck..)

////My kernel:
//nImages - number of images in the memory
//nBytes - number of pixels*color per image (also it's a size of dataOut)
//nImages*nBytes gives us the size of dataIn 
//nBatch - dataIn has 1 milion bytes per image, we run in 6144 threads, so we need 163 batches to calc the whole dataOut
__global__ 
void avg_arrays(unsigned char* cuDataIn, unsigned char* cuDataOut, int nImages, int nBytes, int nBatch) 
{
   //get the position of the correct byte
   int j = threadIdx.x +  nBatch;
   //if we're outside of image then give up
   if(j >= nBytes) return;
   //proceed averaging
   long lSum = 0;
   for(int i=0; i < nImages; ++i) 
      lSum += cuDataIn[i*nBytes + j];
   lSum = lSum / nImages;
   cuDataOut[j] = lSum;
}

Memory allocation etc.

unsigned char* dataIn = 0;
unsigned char* dataOut= 0;

// Allocate and Transfer memory to the devicea
gpuErrchk( cudaMalloc((void**)&dataIn, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char)));                                  //dataIn
gpuErrchk( cudaMalloc((void**)&dataOut, nPixelCountBGR * sizeof(unsigned char)));                               //dataOut
gpuErrchk( cudaMemcpy(dataIn, bmps,  nPixelCountBGR * nNumberOfImages * sizeof(unsigned char), cudaMemcpyHostToDevice ));           //dataIn
gpuErrchk( cudaMemcpy(dataOut, basePixels, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyHostToDevice ));   //dataOut

// Perform the array addition
dim3 dimBlock(N);  
dim3 dimGrid(1);

//do it in batches, unless it's possible to run more threads at once, anyway N is a number of max threads
for(int i=0; i<nPixelCountBGR; i+=N){
   cout << "Running with: nImg: "<< nNumberOfImages << ", nPixBGR " << nPixelCountBGR << ", and i = " << i << endl;
   avg_arrays<<<dimGrid, dimBlock>>>(dataIn, dataOut, nNumberOfImages, nPixelCountBGR, 0);
}
// Copy the Contents from the GPU
gpuErrchk(cudaMemcpy(basePixels, dataOut, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyDeviceToHost)); 

gpuErrchk(cudaFree(dataOut));
gpuErrchk(cudaFree(dataIn));

The error check doesn't bring any messages, all the code runs smoothly, all I get at the end is the exact copy of the first image.

Just in case if someone needs here's some console output:

Running with: nImg: 29, nPixBGR 1228800, and i = 0
...
Running with: nImg: 29, nPixBGR 1228800, and i = 1210368
Running with: nImg: 29, nPixBGR 1228800, and i = 1216512
Running with: nImg: 29, nPixBGR 1228800, and i = 1222656
Time of averaging: 0.219
Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
Pete Kozak
  • 493
  • 1
  • 4
  • 21
  • The basic problem is that you are averaging the `i=0` part over and over again. – Jeffrey Sax Nov 12 '13 at 16:46
  • Jeffrey, can you say something more? Each time I'm starting averaging for a different batch of 6144 pixels and go through all the images stored in cuDataIn (first image start from 0, second from 1228800..). Inside the kernel I attempt to go through all of the images I currentyly keep in memory (29)? – Pete Kozak Nov 12 '13 at 16:50
  • Where in your loop do you actually use `i`? – Jeffrey Sax Nov 12 '13 at 16:50
  • What is `N` ? You're not doing error checking correctly on the kernel call. If N is greater than 512 or 1024, your kernel is not running. – Robert Crovella Nov 12 '13 at 16:52
  • Jeffrey, I've spotted that silly 0 at the end ("..., nPixelCountBGR, 0);") - updated it to "i". - But still no luck :/ Robert, N was set to 6144 in the follwoing way: #define N 6144 Ive tried to change it to lower number than 512 and still no effect, I receive the first image on the output. – Pete Kozak Nov 12 '13 at 16:56
  • @PeteKozak You may want to learn more about blocks and grids. You're cutting up your data into chunks manually here, and calling a kernel for each of your chunks, but you can have it all done automatically in one kernel launch. Udacity's course is great, and free: https://www.udacity.com/course/cs344. – Jeffrey Sax Nov 12 '13 at 19:24

1 Answers1

1

If N is greater than 512 or 1024 (depending on which GPU you are running on, which you don't mention), then this is invalid:

dim3 dimBlock(N); 

because you can't launch a kernel with greater than 512 or 1024 threads per block:

 avg_arrays<<<dimGrid, dimBlock>>>(...
                          ^
                          |
                     this is limited to 512 or 1024

If you study proper cuda error checking and apply it to your kernel launch, you'll trap this kind of error.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • OK this is fixed now. That was a combination of two errors: a) spotted by Jeffrey, instead of passing "i" I was passing 0 and was looping over and over again through the same image b) spotted by Robert, the decreased value of N to 256 brought the expected result. Thanks a lot, you've saved my day! – Pete Kozak Nov 12 '13 at 17:02