Kernel launch failure if the amount of shared memory allocated for the whole grid exceeds 48kB

Question

I am working on a N-body problem requiring a large amount of shared memory.

Basically, there are N independent tasks, each one using 4 doubles variables, i.e. 32 bytes. And a single task is executed by a thread.

For the sake of rapidity, I have been using the shared memory for these variables (given that registers are also being used by threads). Since the number N of tasks is not known at compile time, the shared memory is dynamically allocated.

The dimension of the grid and the shared memory are computed depending on N and the block size:

const size_t BLOCK_SIZE = 512;
const size_t GRID_SIZE = (N % BLOCK_SIZE) ? (int) N/BLOCK_SIZE : (int) N/BLOCK_SIZE +1;
const size_t SHARED_MEM_SIZE = BLOCK_SIZE * 4 * sizeof(double);

Then the kernel is launched using these 3 variables.

kernel_function<<<GRID_SIZE, BLOCK_SIZE, SHARED_MEM_SIZE>>>(N, ...);

For small N, this works fine and the kernel is executed without error.

But if a exceed N = 1500, the kernel launch fails (with the following messages appearing multiple times):

========= Invalid __global__ write of size 8
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaLaunch.

As far as I understand, this is due to the attempt of writing out of the bounds of the allocated shared memory. This occurs when, in the kernel, the global memory is being copied in the shared memory:

__global__ void kernel_function(const size_t N, double *pN, ...)
{
    unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if(idx<N)
    {
        extern __shared__ double pN_shared[];
        for(int i=0; i < 4; i++)
        {
            pN_shared[4*idx + i] = pN[4*idx + i];
        }
        ...
    }
}

This error happens only if N > 1500, hence when the overall amount of shared memory exceeds 48kB (1500 * 4 * sizeof(double) = 1500 * 32 = 48000).
This limit is the same regardless of the grid and the block size.

If I have understood correctly how CUDA works, the cumulated amount of shared memory that the grid uses is not limited to 48kB, and this is only the limit of shared memory that can be used by a single thread block.

This error makes no sense to me since the cumulated amount of shared memory should only affect the way the grid is scheduled among the streaming multiprocessors (and moreover the GPU device has 15 SM at disposal).

"Invalid `__global__` write of size 8" means the error has nothing to do with `__shared__` memory. — talonmies, May 20 '16 at 14:19
@talonmies, you are right. However, this is a common sign of shared memory overflow. Cuda mem check would probably signal this. — Florent DUGUET, May 20 '16 at 16:02

score 3 · Answer 1 · edited May 23 '17 at 12:25

The amount of shared memory you are allocating dynamically here:

kernel_function<<<GRID_SIZE, BLOCK_SIZE, SHARED_MEM_SIZE>>>(N, ...);
                                         ^^^^^^^^^^^^^^^

is the amount per threadblock, and that amount is limited to 48KB (which is 49152, not 48000). So if you attempt to allocate more than 48KB there, you should get an error if you are checking for it.

However we can draw two conclusions from this:

========= Invalid __global__ write of size 8

A kernel did actually launch.
The reported failure has to do with out-of-bounds indexing into global memory, on a write to global memory, not shared memory. (So, it cannot be occurring on a read from global memory to populate shared memory, as your conjecture suggests.)

So in general I think your conclusions are incorrect, and you probably need to do more debugging, rather than arriving at the conclusions about shared memory.

If you want to track down the source of the invalid global write to a specific line of code in your kernel, this question/answer may be of interest.

score 1 · Accepted Answer · answered May 20 '16 at 15:59

You are accessing the shared array at index idx*4+0:3. The program is incorrect starting at N > BLOCK_SIZE. Luckily it seems to work up to 1500. But using cuda mem-check should point out the issue. On a related topic, note that statically allocated shared memory in another location might use shared memory. Printing out the value of the pointer will help figuring out.

chasep255 · Answer 3 · 2016-05-20T14:14:30.330

I think the issue here is that all threads inside a block must run in the same SM. Therefore each block still has the hard limit of 48kB of shared memory. It does not matter how many threads are run in that block. Scheduling does not matter since the GPU can not split the threads in a block across multiple SMs. I would try to reduce the BLOCK_SIZE if you can since that will directly determine the amount of shared memory per block. However if you reduce it too far you can run into issues where you are not fully utilizing the compute resources in an SM. It is a balancing act and from my experience the CUDA architecture presents a lot of interesting trade-offs like this.

Also in your case I am not even sure you need shared memory. I would just use a local variable. I think local variables are stored in global memory but the access to them is aligned so it is very fast. If you want to do something neat with shared memory to improve the performance here is the OpenCL kernel of my N-Body simulator. Using shared memory to create a cache for every thread in a block gives me about a 10x speedup.

In this model each thread is responsible for calculating the acceleration on one body as a result of the gravitational attraction on every other body. This requires each thread looping through all N bodies. This is enhanced with the shared memory cache since each thread in a block can load a different body into the shared memory and they can share them.

__kernel void acceleration_kernel
(
    __global const double* masses, 
    __global const double3* positions,
    __global double3* accelerations,
    const double G,
    const int N,
    __local double4* cache //shared memory cache (local means shared memory in OpenCL)
)
{
    int idx = get_global_id(0);
    int lid = get_local_id(0);
    int lsz = get_local_size(0);

    if(idx >= N)
        return;

    double3 pos = positions[idx];
    double3 a = { };

    //number of loads required to compute accelerating on Body(idx) from all other bodies
    int loads = (N + (lsz - 1)) / lsz;

    for(int load = 0; load < loads; load++)
    {
        barrier(CLK_LOCAL_MEM_FENCE);

        //compute which body this thread is responsible for loading into the cache
        int load_index = load * lsz + lid;
        if(load_index < N)
            cache[lid] = (double4)(positions[load_index], masses[load_index]);

        barrier(CLK_LOCAL_MEM_FENCE);

        //now compute the acceleration from every body added to the cache
        for(int i = load * lsz, j = 0; i < N && j < lsz; i++, j++)
        {
            if(i == idx)
                continue;
            double3 r_hat = cache[j].xyz - pos; 
            double over_r = rsqrt(0.0001 + r_hat.x * r_hat.x + r_hat.y * r_hat.y + r_hat.z * r_hat.z);
            a += r_hat * G * cache[j].w * over_r * over_r * over_r;
        }
    }

    accelerations[idx] = a;
}
double3 pos = positions[idx];
double3 a = { };

int loads = (N + (lsz - 1)) / lsz;

for(int load = 0; load < loads; load++)
{
    barrier(CLK_LOCAL_MEM_FENCE);
    int load_index = load * lsz + lid;
    if(load_index < N)
        cache[lid] = (double4)(positions[load_index], masses[load_index]);
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = load * lsz, j = 0; i < N && j < lsz; i++, j++)
    {
        if(i == idx)
            continue;
        double3 r_hat = cache[j].xyz - pos; 
        double over_r = rsqrt(0.0001 + r_hat.x * r_hat.x + r_hat.y * r_hat.y + r_hat.z * r_hat.z);
        a += r_hat * G * cache[j].w * over_r * over_r * over_r;
    }
}

accelerations[idx] = a;

}

Kernel launch failure if the amount of shared memory allocated for the whole grid exceeds 48kB

3 Answers3