CUDA: use of shared memory to return an array from device functions

Question

Is it possible to allocate shared memory for a kernel (inside or extern) and use it in other device functions called from the kernel? Specially interesting for me will be, if/how I can use it as a returned parameter/array.

It seems to be no problem to use shared memory as input parameter in device functions (at least I get no problems, errors or unexpected results.

When I use it as a return parameter, I get several problems:

I can run the program when it was built from debug configuration.
But I can't debug it -> it crashes in the device functions when I use the shared memory
Also I get errors with cuda-memcheck -> "invalid __global__ read" because the address is out of bound and it reads from shared address space

So is it possible to use shared memory for returning arrays from device functions to kernels?

EDIT:

I wrote a very simple example to exclude other errors done by me.

#define CUDA_CHECK_RETURN(value) {                                      \
    cudaError_t _m_cudaStat = (value);                                  \
    if (_m_cudaStat != cudaSuccess) {                                   \
        printf( "Error %s at line %d in file %s\n",                     \
                cudaGetErrorString(_m_cudaStat), __LINE__, __FILE__);   \
        exit(-1);                                                       \
    } }

__device__ void Function( const int *aInput, volatile int *aOutput )
{
    for( int i = 0; i < 10; i++ )
        aOutput[i] = aInput[i] * aInput[i];
}

__global__ void Kernel( int *aInOut )
{
     __shared__ int aShared[10];

    for(int i=0; i<10; i++)
        aShared[i] = i+1;

    Function( aShared, aInOut );
}

int main( int argc, char** argv )
{
    int *hArray = NULL;
    int *dArray = NULL;

    hArray = ( int* )malloc( 10*sizeof(int) );
    CUDA_CHECK_RETURN( cudaMalloc( (void**)&dArray, 10*sizeof(int) ) );

    for( int i = 0; i < 10; i++ )
            hArray[i] = i+1;
    
    CUDA_CHECK_RETURN( cudaMemcpy( dArray, hArray, 10*sizeof(int), cudaMemcpyHostToDevice ) );
    cudaMemcpy( dArray, hArray, 10*sizeof(int), cudaMemcpyHostToDevice );

    Kernel<<<1,1>>>( dArray );

    CUDA_CHECK_RETURN( cudaMemcpy( hArray, dArray, 10*sizeof(int), cudaMemcpyDeviceToHost ) );
    cudaMemcpy( hArray, dArray, 10*sizeof(int), cudaMemcpyDeviceToHost );
  
    free( hArray );
    CUDA_CHECK_RETURN( cudaFree( dArray ) );
    cudaFree( dArray );

    return 0;
}

I execute the kernel by one thread block and one thread per block. It's no problem to build the program and run it. I get the expected results. But if the program is tested with cuda-memcheck it terminates the kernel and the following log appears:

Error unspecified launch failure at line 49 in file ../CuTest.cu
========= Invalid __global__ read of size 4
=========     at 0x00000078 in /home/strautz/Develop/Software/CuTest/Debug/../CuTest.cu:14:Function(int const *, int volatile *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x01000000 is out of bounds
=========     Device Frame:/home/strautz/Develop/Software/CuTest/Debug/../CuTest.cu:25:Kernel(int*) (Kernel(int*) : 0xd0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x34b) [0x55d0b]
=========     Host Frame:/usr/lib/libcudart.so.5.0 [0x8f6a]
=========
========= Program hit error 4 on CUDA API call to cudaMemcpy 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/lib/libcudart.so.5.0 (cudaMemcpy + 0x2bc) [0x3772c]
=========     Host Frame:[0x5400000]
=========
========= ERROR SUMMARY: 2 errors

Does the shared memory have to be aligned, do I have to do something else or can it be ignored - don't think so?

Seems your Memcpy has problem too. How you allocate the device mem and launch the kernel? It will be perfect if you show a complete .cu file. — kangshiyin, Jan 14 '13 at 16:59
My guess is you are doing no [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Robert Crovella, Jan 14 '13 at 17:32
I updatet once more my post with the whole program. But I'm not at my working pc and currently on this pc I don't even have cuda installed. So the syntax could be wrong (can't check it at the moment) but this is my little test program - I will update the post once more with the right syntax tomorrow, when I will be back at my working pc. — hubs, Jan 14 '13 at 19:14
In the test program I don't error checking, right - I will add it tomorow. But in the "big" program I do it. It can be that I forgot it at some places, I don't hope it. I'll also gonna check this timorrow, too. — hubs, Jan 14 '13 at 19:18
other than a few syntax errors (e.g. cudaMemCpy -> cudaMemcpy), the code you have entered into your question compiles and runs without error and cuda-memcheck reports zero errors. So I don't think this sample represents whatever problem you are having. I'm very suspicious about the code that is producing the problem that the cudaMalloc associated with the global array that you are passing to the kernel is not done correctly, and you are not checking it for errors. — Robert Crovella, Jan 14 '13 at 19:45
Yes, sorry for the syntax errors I posted, but I not having any cuda-application on my pc. But tomorrow on my pc at work I will. Do you builded it under linux with nsight eclipse or under windows with visual studio? I only tested it under linux until now and get this memcheck error for this short program, too. — hubs, Jan 14 '13 at 19:56
In the original post the whole cu-file of the test is posted now, but as I said, I still get the error with cuda-memchecker and since I added cuda errorchecking I get an unspecific launch failure by the second cudaMemcpy? Do you have any idea why you can't test it without an error and I can't do it? I'm trying to extract some code of the original program. This gonna be difficult because it's a big program and I'm not allowed to show all of the code. — hubs, Jan 15 '13 at 07:42
@hubs you may want to use @username to inform others you've commented to them. Since you got an `unspecific launch failure`, I suggest you dig from there. It seems your kernel does not really run. Add `CUDA_CHECK_RETURN(cudaDeviceSynchronize());` just after the kernel launch. Try to profile the program and check the timeline. Show us your launch settings including your real grid/block number. It's not `<<<1,1>>>` in your buggy code, right? — kangshiyin, Jan 17 '13 at 00:58
@Eric: sorry, I'm new here but thanks a lot for your tips and your help. I can start the kernel with any setting and I'll get this error. The memcheck error with out of bound error and the with errorcheck `code=4(cudaErrorLaunchFailure) "cudaDeviceSynchronize()`. But I'll get this only if I don't use a wrapper. — hubs, Jan 17 '13 at 08:26
@Eric: detected the failure. As [here](http://stackoverflow.com/questions/14357827/a-bug-of-cuda-memcheck) described it's driver problem. I updated to the current driver and now everything is fine. — hubs, Jan 17 '13 at 08:47

score 2 · Accepted Answer · edited May 23 '17 at 12:12

2

As here described it was just a driver problem. After I updated to the current one everything is working fine.

edited May 23 '17 at 12:12

Community

1
1

answered Jan 16 '13 at 13:56

hubs

1,779
13
19

kangshiyin · Answer 2 · 2013-01-14T13:22:02.760

see CUDA 5.0 installation file /usr/local/cuda-5.0/samples/6_Advanced/reduction/doc/reduction.ppt

sdata is a local var of device function warpReduce(). It stores the addr of the shared mem. The shared mem can be read/write by the addr within the device function. The final reduction result is then read from shared mem outside warpReduce()

template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
    if (blockSize >=  64) sdata[tid] += sdata[tid + 32];
    if (blockSize >=  32) sdata[tid] += sdata[tid + 16];
    if (blockSize >=  16) sdata[tid] += sdata[tid +  8];
    if (blockSize >=   8) sdata[tid] += sdata[tid +  4];
    if (blockSize >=   4) sdata[tid] += sdata[tid +  2];
    if (blockSize >=   2) sdata[tid] += sdata[tid +  1];
}
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize];  i += gridSize;  }
    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid <  64) { sdata[tid] += sdata[tid +  64]; } __syncthreads(); }

    if (tid < 32) warpReduce(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Unfortunately in the samples i've got installed (5.0.35 production release), in the reduction sample there is no more the device function warpReduce - it's also changed in the version from nvidia developer zone. The reduce is done in the kernel direct. So I made a simple example for me to test it (I will post it afterwards) and i got still some of my initial problems. — hubs, Jan 14 '13 at 15:20
I have to edit my intial post, because it was my first post ever. — hubs, Jan 14 '13 at 15:38
I've found the use of shared memory as returned value from device functions in the scan sample. I'll try to find out what is the difference to my program, because in ths sdk sample I don't get errors with memcheck. — hubs, Jan 15 '13 at 11:56
Ok, it seems that you have to buil a wrapper to avoid this error. But I don't know realy why.. — hubs, Jan 16 '13 at 15:07
@hubs the code in your answer uses a wrapper, but mine does not. It's more likely to be a bug of memcheck according to your code. — kangshiyin, Jan 16 '13 at 15:10

CUDA: use of shared memory to return an array from device functions

2 Answers2