How do I properly implement classes whose members are called both from host and device code in Cuda/C++?

Question

I have a host class TestClass that has as a member a pointer to a class TestTable which has its data stored in an array of floats on the GPU. TestClass calls a kernel that accesses the data inside TestTable, as well as a method GetValue() from TestClass.

After reading a lot and trying out several options on which type specifiers to use for which methods and classes as well as on how (and where) to initialize TestTable, I have the feeling that all my options eventually boil down to the same memory access error. So probably my understanding of how Cuda/C++ works is not sufficient for implementing it right. How should my code be properly set up?

This is the content of a minimal version of my main.cu:

#include <iostream>
#include <cuda_runtime.h>

#define CUDA_CHECK cuda_check(__FILE__,__LINE__)
inline void cuda_check(std::string file, int line)
{
    cudaError_t e = cudaGetLastError();
    if (e != cudaSuccess) {
        std::cout << std::endl
                  << file << ", line " << line << ": "
                  << cudaGetErrorString(e) << " (" << e << ")" << std::endl;
        exit(1);
    }
}

class TestTable {

    float* vector_;
    int num_cells_;

public:

    void Init() {
        num_cells_ = 1e4;
        cudaMallocManaged(&vector_, num_cells_*sizeof(float));
        CUDA_CHECK;
    }

    void Free() {
        cudaFree(vector_);
    }

    __device__
    bool UpdateValue(int global_index, float val) {
        int index = global_index % num_cells_;
        vector_[index] = val;
        return false;
    }

};

class TestClass {

private:

    float value_;
    TestTable* test_table_;

public:

    TestClass() : value_(1.) {
        // test_table_ = new TestTable;
        cudaMallocManaged(&test_table_, sizeof(TestTable));
        test_table_->Init();
        CUDA_CHECK;
    }

    ~TestClass() {
        test_table_->Free();
        cudaFree(test_table_);
        CUDA_CHECK;
    }

    __host__ __device__
    float GetValue() {
        return value_;
    }

    __host__
    void RunKernel();

};

__global__
void test_kernel(TestClass* test_class, TestTable* test_table) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < 1e6; i += stride) {
        const float val = test_class->GetValue();
        test_table->UpdateValue(i, val);
    }
}

__host__
void TestClass::RunKernel() {
    test_kernel<<<1,1>>>(this, test_table_);
    cudaDeviceSynchronize(); CUDA_CHECK;
}

int main(int argc, char *argv[]) {

    TestClass* test_class = new TestClass();
    std::cout << "TestClass successfully constructed" << std::endl;

    test_class->RunKernel();
    std::cout << "Kernel successfully run" << std::endl;

    delete test_class;
    std::cout << "TestClass successfully destroyed" << std::endl;

    return 0;
}

The error I get is line 88: an illegal memory access was encountered (700).

I assume the error lies in one of these issues:

TestTable is not created properly using new, which is probably bad. However, uncommenting test_table_ = new TestTable; in TestClass() does not resolve the issue.
GetValue() in test_kernel does not return a valid float variable. If I replace it by an arbitrary float, e.g. 1.f, the program runs without errors. However, in the real (not minimal) version of my code, GetValue() does a bunch of computations which happen at different points in the code base, so hard-coding is not an option there.
I never copy TestClass to the GPU, but call one of its member functions from the kernel. I see that this must cause trouble, but I do not find it intuitive to know where and how to copy it. If I only call GetValue() in the kernel without reusing its result, there is no error, so it seems that my program can call GetValue() without the class being copied to the GPU.

Possibly related questions that I was not able to apply to my specific problem:

Accessing class data members from within cuda kernel - how to design proper host/device interaction? - this one looks very similar, but somehow I do not manage to translate it to my use case.
Accessing Class Member in different CUDA kernels - here, I am not sure how the fact that I have two classes "interacting" with each other would affect the solution.
CUDA and Classes - this questions seems much more generic to me.

Any help is highly appreciated!

Robert Crovella · Accepted Answer · 2020-12-16T16:54:15.220

The problem here has to do with how you allocate for TestClass:

TestClass* test_class = new TestClass();

test_class is now an ordinary pointer to host memory. If you have any intent of using that pointer in device code:

void TestClass::RunKernel() {
    test_kernel<<<1,1>>>(this, test_table_);
                         ^^^^

and:

void test_kernel(TestClass* test_class, TestTable* test_table) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < 1e6; i += stride) {
        const float val = test_class->GetValue();
                          ^^^^^^^^^^

that won't work. In CUDA, dereferencing a host pointer in device code is generally a fundamental problem.

We can fix this by using placement new with a managed allocator, for the top-level class:

//TestClass* test_class = new TestClass();
TestClass* test_class;
cudaMallocManaged(&test_class, sizeof(TestClass));
new(test_class) TestClass();

When we do so, its necessary to also change the deallocator. And as indicated in the comment, you should also make sure the destructor is called before de-allocation:

// delete test_class;
test_class->~TestClass();
cudaFree(test_class);

When I make those changes, your code runs without runtime error for me.

You should also do `test_class->~TestClass();` before `cudaFree` in case it has a destructor. — user253751, Dec 16 '20 at 16:38
Thank you so much for this detailed explanation! It totally makes sense, I will implement this in my real code now and hope it also works there. Do I understand it correctly that the `test_class` pointer can also be used in host code, or is that only true for the particular case of calling the destructor? — talvi, Dec 17 '20 at 13:05
A pointer allocated using a managed allocator should be usable in both host and device code. Having said that, if you are using that pointer to invoke class methods, you should make sure that is sensible usage and such methods are properly decorated with `__host__` `__device__`. — Robert Crovella, Dec 17 '20 at 14:45

How do I properly implement classes whose members are called both from host and device code in Cuda/C++?

1 Answers1