Wrapping CUDA shared memory definition and accesses by a struct and overloading operators

Question

In the piece of code here I came across an struct for the shared memory definition and usages. I modified the allocation to be static and used it in a test program like below:

#include <stdio.h>

template<class T, uint bDim>
struct SharedMemory
{
     __device__ inline operator T *() {
        __shared__ T __smem[ bDim ];
        return (T*) (void *) __smem;
    }
     __device__ inline operator const T *() const {
        __shared__ T __smem[ bDim ];
        return (T*) (void *) __smem;
    }
};

template <uint bDim>
__global__ void myKernel() {
    SharedMemory<uint, bDim> myShared;
    myShared[ threadIdx.x ] = threadIdx.x;
    __syncthreads();
    printf("%d\tsees\t%d\tat two on the circular right.\n", threadIdx.x,     myShared[ ( threadIdx.x + 2 ) & 31 ]);
}

int main() {
    myKernel<32><<<1, 32>>>();
    cudaDeviceSynchronize();
    return 0;
}

It works fine as predicted. However, I have a few questions about this usage:

I don't understand the syntax used in the operator overloading section in the sharedMemory struct. Is it overloading the dereference operator *? If yes, how accesses via square bracket translate into dereference pointer? Also, why does changing __device__ inline operator T *() { line into __device__ inline T operator *() { produce compiler errors?

I wanted to ease the use of the wrapper by overloading the assignment operator or defining a member function, so that each thread updates the shared memory location corresponding to its thread index. So that, for example, writing down myShared = 47; or myShared.set( 47 ); translates into myShared[threadIdx.x] = 47; behind the curtain. But I have been unsuccessful doing this. It compiles fine but the shared memory buffer is read all 0 (which I think is the default shared memory initialization in the Debug mode). Can you please let me know where I'm doing things wrong? Here's my try:

template<class T, uint bDim>
struct SharedMemory
{
     __device__ inline operator T*() {
        __shared__ T __smem[ bDim ];
        return (T*) (void *) __smem;
    }
     __device__ inline operator const T *() const {
        __shared__ T __smem[ bDim ];
        return (T*) (void *) __smem;
    }
    __device__ inline T& operator=( const T& __in ) {
        __shared__ T __smem[ bDim ];
        __smem[ threadIdx.x ] = __in;
        return (T&) __smem[ threadIdx.x ];
    }
    __device__ inline void set( const T __in ) {
        __shared__ T __smem[ bDim ];
        __smem[ threadIdx.x ] = __in;
    }

};

For the member function, the compiler gives out a warning:

variable "__smem" was set but never used

Although I am aware member variables cannot be __shared__, I'm thinking I have a wrong assumption about or what I want to do is not matched with the __shared__ qualifier characteristics. I appreciate the help.

Why would you modify the memory to be declared static at operator scope? That makes absolutely no sense. — talonmies, Aug 26 '15 at 20:15
`operator const T *()` is an [implicit-explicit user defined conversion operator (cast operator)](http://en.cppreference.com/w/cpp/language/cast_operator). Looks like you don't understand what [assignment operator](http://en.cppreference.com/w/cpp/language/as_operator) does (pick some [book on C++](http://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list)). What you really need to overload is `operator[]`. Overall approach seems rather strange. — Ivan Aksamentov - Drop, Aug 26 '15 at 20:21
@talonmies shared memory size needed to be allocated is known at compile time. May I know why it makes no sense? — Farzad, Aug 26 '15 at 20:25
In the first structure definition you posted, each member has a local scope definition of __smem. Understand they don't refer to the same memory. Also understand that returning a pointer to those __smem is technically undefined behaviour (although it will accidentally work because of the way shared memory works in CUDA). You have taken what amounts to a template trick to fool the compiler and are trying to use it for something it was never intended for. That's why I say it makes no sense. — talonmies, Aug 27 '15 at 05:49
@talonmies thanks. I think I get it now. I had been thinking that `__shared__` makes the variable keep living outside the scope (similar to `static` keyword), and again I was wrong that its re-declaration is required to be accessed. — Farzad, Aug 27 '15 at 06:05
I think the only reason that any of this even compiles is because of the `inline` keyword. It is illegal in CUDA to specify memory types for class member variables — talonmies, Aug 27 '15 at 06:43

score 4 · Accepted Answer · answered Aug 27 '15 at 12:25

It appears you had a few misunderstandings about what the __shared__ access specifier actually does in CUDA and that, combined with a rather tricky template designed to fool the compiler for the case where extern __shared__ memory is used in templated kernel instances, led you down a blind path.

If I have understood your need correctly, what you really are looking for is something like this:

template<typename T>
struct wrapper
{
    T * p;
    unsigned int tid;

    __device__ wrapper(T * _p, unsigned int _tid) : p(_p), tid(_tid) {}
    __device__ const T* operator->() const { return p + tid; }
    __device__ T& operator*() { return *(p + tid); }
    __device__ const T& operator*() const { return *(p + tid); }
};

This is a wrapper which you can use to "hide" a pointer and an offset to have "indexing" free access to the pointer, for example:

#include <cstdio>

// structure definition goes here

void __global__ kernel(float *in)
{
    __shared__ float _buff[32];
    wrapper<float> buff(&_buff[0], threadIdx.x);

    *buff = in[threadIdx.x + blockIdx.x * blockDim.x];
    __syncthreads();

    for(int i=0; (i<32) && (threadIdx.x == 0); ++i) { 
        printf("%d %d %f\n", blockIdx.x, i, _buff[i]);
    }
}

int main()
{
    float * d = new float[128];
    for(int i=0; i<128; i++) { d[i] = 1.5f + float(i); }

    float * _d;
    cudaMalloc((void **)&_d, sizeof(float) * size_t(128));
    cudaMemcpy(_d, d, sizeof(float) * size_t(128), cudaMemcpyHostToDevice);

    kernel<<<4, 32>>>(_d);
    cudaDeviceSynchronize();
    cudaDeviceReset();

    return 0;
}

In the example kernel, the shared memory array _buff is wrapped with the thread index within a wrapper instance, and the operator overloads let you access a specific array element without the usual explicit indexing operation. Perhaps you can modify this to suit your needs.

Wrapping CUDA shared memory definition and accesses by a struct and overloading operators

1 Answers1