Reducing memory hit in CUDA parallization

Question

I have a loop that I am trying to parallelize in CUDA. It goes something like this:

float *buf = new float[buf_size]; // buf_size <= 100
for (int j; j<N; j++){
    caluculate_with(buf);
}
delete [] buf;

The nature of the loop is that it does not matter the values in the buffer array at the beginning of each iteration. So that the loop itself can be quite trivially parallelized.

But in CUDA, I now need a much larger buffer because of asynchronous call to kernel.

void __global__ loop_kernel(float *buf_gpu) {
    const int idx = index_gpu(blockIdx, blockDim, threadIdx);
    float *buf = buf_gpu + (idx*buf_size);
    caluculate_with(buf);
}
    ....
    float * buf_gpu;
    cudaMalloc(&buf_gpu,sizeof(float)*N*buf_size);
    loop_kernel<<<mesh,block>>>(buf_gpu);
    cudaFree(buf_gpu);
}

Since each call to the kernel gets its own segment of the buffer, the buffer size now scales with loop size N, which is obvious problematic. Instead of using (buffer size) amount of memory, I now have to allocate (buffer size * loop size). The GPU memory limit of my GTX590 is hit for somewhat typical value of N in the problem I am working on).

EDIT: elaborate on my other attempt. Since the buf_size is not too big, I also tried rewriting the kernel like this:

void __global__ loop_kernel() {
    float *buf = new float[buf_size];
    caluculate_with(buf);
    delete [] buf;
}
...
assert(cudaSuccess == cudaDeviceSetLimit(cudaLimitMallocHeapSize,8*1024*1024));
loop_kernel<<<mesh,block>>>();
assert(cudaSuccess == cudaDeviceSynchronize());

The cudaDeviceSynchronize() assertion fails with return status 4. No idea what that means.

You could show what has been done in `calculate_with()`. Your probably should find other ways to parallel your code. another thing is using global mem as buffer usually is not good way since global mem is slow compared to GPU core. — kangshiyin, Aug 30 '13 at 02:34
it looks like ur using sizeof(float) * block_size ie., 4*100 bytes (approx assuming sizeof(float) =4 ) memory in CPU case and when it comes to GPU ur using sizeof (float) * N * block_size ie., (4 * 100 * 100) — Sagar Masuti, Aug 30 '13 at 05:24
@Eric, It is complicated. For the sake of argument. Let's say it does the following assignment: buf[x] = bessel_function(j, x); and then use the value (read-only) in buf array to calculate some other thing. I do appreciate your point about using global mem for buffer. This is my first attempt at CUDA-izing the code, so it still looks somewhat like the original code. As mentioned, I also tried, in the loop_kernel function: `float *buf=new float[buf_size]; ...; delete[] buf;` but cudaDeviceSynchronize(); after the loop_kernel call keep failing with status code 4. I don't know what that mean. — lasagne.victim, Aug 30 '13 at 07:56
Error 4 is `cudaErrorLaunchFailure`, ie. your kernel never launched. And that is probably because you have set the runtime heap to be so small. — talonmies, Aug 30 '13 at 08:19
"no idea what that means". You can find out if you do [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) — Robert Crovella, Aug 30 '13 at 11:40

score 1 · Answer 1 · answered Sep 02 '13 at 16:31

You haven't told us anything about calculate_with() so it's not clear if any of that is parallelizable, but that is certainly something that may be worth investigating.

One approach, however, is simply to limit your buffer size to what can be handled by GPU memory, and then call the kernel in a loop based on that buffer size:

void __global__ loop1_kernel(float *buf_gpu) {
  const int idx = index_gpu(blockIdx, blockDim, threadIdx);
  float *buf = buf_gpu + (idx*buf_size);
  caluculate_with(buf);
}
....
float * buf_gpu;
cudaMalloc(&buf_gpu,sizeof(float)*num_buffs*buf_size);
for (int j=0; j<(N/num_buffs; j++){
  loop_kernel<<<mesh,block>>>(buf_gpu);
  cudaMemcpy(host_data, buf_gpu, (sizeof(float)*num_buffs*buf_size), cudaMemcpyDeviceToHost);
  }
cudaFree(buf_gpu);
}

Obviously, the cudaMemcpy line only needs to be whatever data is actually produced that needs to be saved from the kernel operation.

Reducing memory hit in CUDA parallization

1 Answers1