Initialize big images using cuda

Question

I am working on HD image processing using CUDA 7.5 with NVIDIA GEFORCE 840M on Ubuntu 14.04. I have a 3750*3750 image, and I have troubles to initialize an array of this dimension. The following code runs until k is about 4000.

__device__ int sImg;

__device__ int *B;

/* ############################### INITILIAZE ############################## */

__global__ void initialize(int *tab, int v, int s)
{    
  int k = blockDim.x*blockIdx.x + threadIdx.x ;
  if ( k < s )
    tab[k] = v;
}

/* ########################### The parent kernel ########################### */

__global__ void EDGE(int *A, int *C ,int h, int w, int dim, int nbScales)
{
  sImg = dim*dim;
  cudaMalloc((void**)&B,sImg*sizeof(int));

  int threadsPerBlock = 256;
  int blocksPerGrid = (sImg + threadsPerBlock -1) / threadsPerBlock;

  /// I have troubles here, it does not complete the process
  initialize<<<blocksPerGrid,threadsPerBlock>>>(B,0,sImg);
  cudaDeviceSynchronize();
  initialize<<<blocksPerGrid,threadsPerBlock>>>(C,0,sImg);
  cudaDeviceSynchronize();  

  /// A transormation into frequency domain
  FSDWT <<< 1 , nbScales >>> (A,B, h, w,dim,nbScales);
  cudaDeviceSynchronize();

  /// Tresholding the transform                                     
  Treshold<<<1,1>>>(B,C,dim*dim);
  cudaDeviceSynchronize();

  cudaFree(B);
}

/* ############################  call from host ############################ */

extern "C" void EDGE_host(int *A,int *B,int h,int w,int dim, int nbScales)
{
  EDGE <<< 1 , 1 >>> (A,B, h, w,dim,nbScales);
}

Thank you very much

Your `parentKernel` looks like it contains code much more suitable to run on the host - I can't think of a reason to run it as a kernel other than an exercise in using dynamic parallelism. Is that intentional? — tera, Jan 26 '17 at 12:38
Also, check all CUDA calls for errors returned. That will likely show why my answer below is important. — tera, Jan 26 '17 at 15:52
I can not put here all my code because I have a lot of child kernel and operations. I want just to resolve the initialize child before moving to other kernel more complexe. CUDA does not return any error — assma, Jan 26 '17 at 15:55
"CUDA does not return any error" - how do you know if you don't [check](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api)? — tera, Jan 26 '17 at 16:12
I used this functions if (cudaSuccess != cudaGetLastError()) { return; } if (cudaSuccess != cudaDeviceSynchronize()) { return; } — assma, Jan 26 '17 at 16:59

pSoLT · Answer 1 · 2017-01-26T11:59:29.887

1

Ok, couple of things:

1) use cudaMalloc instead of malloc

2) in cudaMalloc use sizeImage instead of dim*dim ( i assume they are the same)

edited Jan 26 '17 at 11:59

answered Jan 26 '17 at 11:56

pSoLT

1,042
9
18

I replace malloc with cudaMalloc and replace dim*dim with sizeImage but the computations stop at 4511. – assma Jan 26 '17 at 12:09
Can you update the code in the original question each time you make modifications ? – pSoLT Jan 26 '17 at 12:09
1

Also, could you please explain more how you use `parentKernel`? – pSoLT Jan 26 '17 at 12:53

tera · Answer 2 · 2017-01-26T12:48:33.703

Memory allocations made from device code are satisfied from a pool of limited size. Either set aside more memory for device side allocations before any is made with a call to cudaDeviceSetLimit(cudaLimitMallocHeapSize, ...), or allocate the memory from the host side using cudaMalloc().

When choosing how much memory to set aside, be aware that a call to malloc() on the device makes a separate allocation for each thread, so requirements increase quickly with the number of threads running in parallel. If, as in your case, the kernel doesn't free the memory, the memory needed grows with the total number of threads run, rather than with the number of threads running in parallel.

score 0 · Answer 3 · answered Jan 26 '17 at 20:00

Thank you very much dear all My problem is solved with the answer of Mr tera. Effectively, the problem was with memory allocation. I added the following line in the main function before calling any kernel.

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);

Initialize big images using cuda

3 Answers3