Simplest way to clear CUDA shared memory between kernel runs

Question

I am trying to implement box filter in C-CUDA, starting with implementing matrix average problem in CUDA first. When I try following code without commenting those lines within for loops than I get the certain output. But when I comment those lines then it generates the same output again!

if(tx==0)
        for(int i=1;i<=radius;i++)
        {
            //sharedTile[radius+ty][radius-i] = 6666.0;
        }

    if(tx==(Dx-1))
        for(int i=0;i<radius;i++)
        {
            //sharedTile[radius+ty][radius+Dx+i] = 7777;
        }

    if(ty==0)
        for(int i=1;i<=radius;i++)
        {
            //sharedTile[radius-i][radius+tx]= 8888;
        }

    if(ty==(Dy-1))
        for(int i=0;i<radius;i++)
        {
            //sharedTile[radius+Dy+i][radius+tx] = 9999;
        }

    if((tx==0)&&(ty==0))
        for(int i=globalRow,l=0;i<HostPaddedRow,l<radius;i++,l++)
        {
            for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
            {
                //sharedTile[l][m]=8866;
            }
        }

    if((tx==(Dx-1))&&(ty==(Dx-1)))
        for(int i=(HostPaddedRow+1),l=(radius+Dx);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
        {
            for(int j=HostPaddedCol,m=(radius+Dx);j<(HostPaddedCol+radius),m<(TILE+2*radius);j++,m++)
            {
                //sharedTile[l][m]=7799.0;
            }
        }

    if((tx==(Dx-1))&&(ty==0))
        for(int i=(globalRow),l=0;i<HostPaddedRow,l<radius;i++,l++)
        {
            for(int j=(HostPaddedCol+1),m=(radius+Dx);j<(HostPaddedCol+1+radius),m<(TILE+2*radius);j++,m++)
            {
                //sharedTile[l][m]=9966;
            }
        }

    if((tx==0)&&(ty==(Dy-1)))
        for(int i=(HostPaddedRow+1),l=(radius+Dy);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
        {
            for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
            {
                //sharedTile[l][m]=0.0;
            }
        }
    __syncthreads();

You can ignore those for loop conditions and all, they are irrelevant here right now. May basic problem and question is why am I getting the same vales even after commenting those lines? I tried making some modification in my main program and kernel as well. Also entered manual errors and removed them, and again compiled and executed the same code, but still getting same values. Is there any way to clear cache memory in CUDA? I am using Nsight + RedHat + CUDA 5.5. Thanks in advance.

score 2 · Accepted Answer · edited May 23 '17 at 11:46

why am I getting the same vales even after commenting those lines?

Seems like sharedTile is pointing to the same piece of memory between multiple consecutive runs which is absolutely normal. Therefore commented out code does not "generate" anything, it is just your pointer pointing to the same memory which was not flushed.

Is there any way to clear cache memory in CUDA

I believe you are talking about clearing shared memory? If so then you can use analogy of approach described here. Instead of using cudaMemset in host code you'll be zeroing out your shared memory from inside of kernel. The simplest approach is to place following code at the beginning of your kernel which declares sharedTile (this is for one dimensional thread blocks, one dimensional shared memory array):

__global__ void your_kernel(int count) {
    extern __shared__ float* sharedTile;
    for (int i = threadIdx.x; i < count; i += blockDim.x)
        sharedTile[i] = 0.0f;
    __syncthreads();
    // your code here
}

Following approaches do not guarantee clear shared memory as Robert Crovella pointed out in below comment:

Or possibly call nvidia-smi with --gpu-reset parameter.
Yet another solution was offered in the other SO thread which includes driver unloading and reloading.

The linked approach ("described [here](http://stackoverflow.com/a/9518858/3242721)", i.e. `cudaMemset`) cannot be used with shared memory. Niether `nvidia-smi --gpu-reset` nor unloading/reloading the driver are guaranteed to have any effect on the contents of shared memory. All 3 of your suggested approaches to clear shared memory are suspect. The only reliable way to modify shared memory is to write kernel code that modifies shared memory. — Robert Crovella, Aug 23 '14 at 16:25
By pointing to "described here" approach I did not have usage of `cudaMemset` in mind. It was more of a high level resemblance to clearing shared memory by setting the bytes to zero explicitly even though this time it would be set from inside of kernel instead. Thats what I had in mind by saying "you'll be zeroing out your shared memory from inside of kernel". I'll update my answer with your findings, thanks. — Michal Hosala, Aug 23 '14 at 17:57

Simplest way to clear CUDA shared memory between kernel runs

1 Answers1

Linked