I am working on a N-body problem requiring a large amount of shared memory.
Basically, there are N
independent tasks, each one using 4 doubles variables, i.e. 32 bytes. And a single task is executed by a thread.
For the sake of rapidity, I have been using the shared memory for these variables (given that registers are also being used by threads). Since the number N
of tasks is not known at compile time, the shared memory is dynamically allocated.
The dimension of the grid and the shared memory are computed depending on
N
and the block size:const size_t BLOCK_SIZE = 512; const size_t GRID_SIZE = (N % BLOCK_SIZE) ? (int) N/BLOCK_SIZE : (int) N/BLOCK_SIZE +1; const size_t SHARED_MEM_SIZE = BLOCK_SIZE * 4 * sizeof(double);
Then the kernel is launched using these 3 variables.
kernel_function<<<GRID_SIZE, BLOCK_SIZE, SHARED_MEM_SIZE>>>(N, ...);
For small N
, this works fine and the kernel is executed without error.
But if a exceed N = 1500
, the kernel launch fails (with the following messages appearing multiple times):
========= Invalid __global__ write of size 8
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaLaunch.
As far as I understand, this is due to the attempt of writing out of the bounds of the allocated shared memory. This occurs when, in the kernel, the global memory is being copied in the shared memory:
__global__ void kernel_function(const size_t N, double *pN, ...)
{
unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;
if(idx<N)
{
extern __shared__ double pN_shared[];
for(int i=0; i < 4; i++)
{
pN_shared[4*idx + i] = pN[4*idx + i];
}
...
}
}
This error happens only if N > 1500
, hence when the overall amount of shared memory exceeds 48kB (1500 * 4 * sizeof(double) = 1500 * 32 = 48000
).
This limit is the same regardless of the grid and the block size.
If I have understood correctly how CUDA works, the cumulated amount of shared memory that the grid uses is not limited to 48kB, and this is only the limit of shared memory that can be used by a single thread block.
This error makes no sense to me since the cumulated amount of shared memory should only affect the way the grid is scheduled among the streaming multiprocessors (and moreover the GPU device has 15 SM at disposal).