I have a loop that I am trying to parallelize in CUDA. It goes something like this:
float *buf = new float[buf_size]; // buf_size <= 100
for (int j; j<N; j++){
caluculate_with(buf);
}
delete [] buf;
The nature of the loop is that it does not matter the values in the buffer array at the beginning of each iteration. So that the loop itself can be quite trivially parallelized.
But in CUDA, I now need a much larger buffer because of asynchronous call to kernel.
void __global__ loop_kernel(float *buf_gpu) {
const int idx = index_gpu(blockIdx, blockDim, threadIdx);
float *buf = buf_gpu + (idx*buf_size);
caluculate_with(buf);
}
....
float * buf_gpu;
cudaMalloc(&buf_gpu,sizeof(float)*N*buf_size);
loop_kernel<<<mesh,block>>>(buf_gpu);
cudaFree(buf_gpu);
}
Since each call to the kernel gets its own segment of the buffer, the buffer size now scales with loop size N, which is obvious problematic. Instead of using (buffer size) amount of memory, I now have to allocate (buffer size * loop size). The GPU memory limit of my GTX590 is hit for somewhat typical value of N in the problem I am working on).
EDIT: elaborate on my other attempt. Since the buf_size is not too big, I also tried rewriting the kernel like this:
void __global__ loop_kernel() {
float *buf = new float[buf_size];
caluculate_with(buf);
delete [] buf;
}
...
assert(cudaSuccess == cudaDeviceSetLimit(cudaLimitMallocHeapSize,8*1024*1024));
loop_kernel<<<mesh,block>>>();
assert(cudaSuccess == cudaDeviceSynchronize());
The cudaDeviceSynchronize() assertion fails with return status 4. No idea what that means.