-1

I'm programming a Kernel for a 3x3 median filter and want to apply it to images. My images are stored like float *myImage = new float[pixelCount * channelCount] in RGB.

I launch a thread for each pixel and calculate all 3 colors in each thread.

I tried it on different image sizes with different results:

  • 512x512: No cuda errors, erros when running with cuda-memcheck
  • 1024x1024 and higher: cuda errors and memcheck-erros

The cuda error I get:

an illegal memory access was encountered 

The (first) cuda-memcheck output:

========= CUDA-MEMCHECK  
========= Invalid __global__ read of size 4  
=========     at 0x00001410 in   BackwardMappingCUDAUtils::parallelMedianInImage(float*, float*, unsigned int, unsigned int, int)  
=========     by thread (257,0,0) in block (127,0,0)  
=========     Address 0x7f535e5c0000 is out of bounds  
=========     Saved host backtrace up to driver entry point at kernel launch time  
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1   (cuLaunchKernel + 0x2cd) [0x22b3fd]  
=========     Host Frame:/path/to/libcudart.so.9.1 [0x15f70]  
=========     Host Frame:/path/to/libcudart.so.9.1 (cudaLaunch + 0x14e) [0x347be]  
=========     Host Frame:/path/to/build_debug  /lib/libBackwardMappingCudaUtilsD.so [0x23fc]  
=========     Host Frame:/path/to/build_debug  /lib/libBackwardMappingCudaUtilsD.so  (_Z75__device_stub__ZN24BackwardMappingCUDAUtils21parallelMedianInImageEPfS0_jjiPfS_jji + 0xd6) [0x20f2]  
=========     Host Frame:/path/to/build_debug/lib/libBackwardMappingCudaUtilsD.so (_ZN24BackwardMappingCUDAUtils21parallelMedianInImageEPfS0_jji + 0x36) [0x2139]  
=========     Host Frame:./CUDAStream (main + 0x1476) [0xf211]  
=========     Host Frame:/path/to/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]  
=========     Host Frame:./CUDAStream (_start + 0x2a) [0xd63a]  

The Kernel:

__global__
void parallelMedianInImage(float *source, float *sink, unsigned int width, unsigned int pixelCount, int channelCount)
{
  unsigned int pixelID = blockIdx.x * blockDim.x + threadIdx.x; 
  unsigned int colorIndexRed = pixelID * channelCount;
  unsigned int colorWidth = width * channelCount;
  unsigned int valueCount = pixelCount * channelCount;

  if(pixelID<pixelCount)
  {
    int validValues = 0;
    bool valid[9];
    int indizes[9];

    indizes[0] = colorIndexRed - colorWidth - channelCount;
    indizes[1] = colorIndexRed - colorWidth;
    indizes[2] = colorIndexRed - colorWidth + channelCount;
    indizes[3] = colorIndexRed - channelCount;
    indizes[4] = colorIndexRed;
    indizes[5] = colorIndexRed + channelCount;
    indizes[6] = colorIndexRed + colorWidth - channelCount;
    indizes[7] = colorIndexRed + colorWidth;
    indizes[8] = colorIndexRed + colorWidth + channelCount;

    for(int u=0;u<9;u++)
    {
      valid[u] = true;

      if(u/3==0&&((indizes[u] / colorWidth) != (colorIndexRed / colorWidth) - 1)) valid[u] = false;
      if(u/3==1&&((indizes[u] / colorWidth) != (colorIndexRed / colorWidth))) valid[u] = false;
      if(u/3==2&&((indizes[u] / colorWidth) != (colorIndexRed / colorWidth) + 1)) valid[u] = false;
      if(indizes[u]<0 || indizes[u]>valueCount) valid[u] = false;

      if(valid[u]) validValues++;
    }

    for(int channel=0;channel<channelCount;channel++)
    {
      float values[9];
      for(int u=0;u<9;u++)
      {
        if(valid[u])
          values[u]=source[indizes[u] + channel];
        else
          values[u]=0.0;

      }
      insertionSortFloatArray(values, 9);
      int middleIndex = 8 - (validValues/2);
      sink[colorIndexRed + channel] = values[middleIndex];
    }
  }
}

The valid variables are for checking if all values are inside the image boundaries.
Sorting function:

__device__
void insertionSortFloatArray(float array[], int length)
{
  float swapper;
  for(int i=1;i<length;i++)
  {
    swapper = array[i];
    for(int u=i-1;u>=0;u--)
    {
      if(array[u]>swapper)
      {
        array[u+1] = array[u];
        array[u] = swapper;
      }
    }
  }
}

The kernel call, running in a loop because I have several images:

cudaMalloc((void**)&smallUndistortedDeviceImages[reducedIndex], sizeSmall);  
parallelMedianShrinking<<<(pixelCountSmall+TPB-1)/TPB,TPB>>>(undistortedDeviceImages[reducedIndex], smallUndistortedDeviceImages[reducedIndex], widthSmall, pixelCountSmall, channelCount);
error = cudaGetLastError();
if(error != cudaSuccess)
{
  printf(" ### CUDA error: %s\n", cudaGetErrorString(error));
}
//removed code that copies the result to the devices and stores it as an image
cudaFree(smallLightMaskStep1Images[reducedIndex]);

I tried this with different Thread per Block variables.
As the kernel is in a library that is linked dynamically, cuda-memcheck doesn't tell the exact line of the segfault. As it always happen in thread (1,0,0), I wrote a printf with if(threadIdx.x==1) that gave me all the variable addresses I could think of to compare it with the one from the cuda-memcheck output afterwars, but I could never found which variable it was. Due to outcommenting specific lines, I could trace it to somewhere near the last 3 lines of the kernel. The sorting function works and is used in another kernel. But when i comment the sorting call out and just use values[4], then it works (as copying the image without a filter).

Can't wrap my head around this. Thank you in advance. This is my first post, I hope I included everything relevant, sorry if I missed something.

Specs: Ubuntu 18.04, Cuda V9.1.85, Geforce GTX1080 with 8GB RAM

Turtle10000
  • 213
  • 1
  • 13
  • 2
    For questions like this, you are supposed to provide a [mcve]. See item 1 [here](https://stackoverflow.com/help/on-topic). Complete means someone else could copy what you have here, compile, and run, and see the issue, without having to add or change anything. That's not possible because your code isn't complete. It doesn't have to be *your whole code*. (It should be minimal, also.) Also, for basic debugging of issues like this with `cuda-memcheck`,try using [this method](https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218). – Robert Crovella Jul 26 '18 at 19:23
  • You are almost certainly passing a host pointer to your kernel. But you haven't poased an [MCVE] so there is no way to say for sure. I've voted to close this – talonmies Jul 26 '18 at 21:21
  • Does the sort run on the cpu or gpu? – Surt Jul 26 '18 at 23:40
  • Thank you all, I found the mistake and posted it as an answer. Sorry for not posting a minimal, complete and verifiable example, this was quite impossible as I use some not-longer-public libs to load the image, but I see that it was naive to ask anyway. – Turtle10000 Jul 27 '18 at 15:42
  • 1
    The image loading process is irrelevant for the problem at hand. You misunderstand what is being asked for in the [mcve]. It certainly is not/was not impossible. – Robert Crovella Jul 27 '18 at 15:58

1 Answers1

1

The error was in if(indizes[u]<0 || indizes[u]>valueCount) valid[u] = false; where it should say indizes[u]>=valueCount as of course we are counting from zero. This caused the segfault. It works fine now.

Turtle10000
  • 213
  • 1
  • 13