How to implement custom type that allocates in the device heap from host in CUDA C++?

Question

What I want to achieve is: implement a custom type that that can be used from de __device__, it needs to allocate memory on the heap, and I'd like to allocate it from the __host__. I have this:

#include <iostream>

template <typename T>
class Array{
    T *arr;
    size_t* _size;
public:
    
    /**
     * Move constructor; this allows the array to be passed
     * as a return value from a function sapping the pointers
     * and keeping the allocated data on the heap.
     */
    __device__
    Array(Array&& other){
        arr = other.arr;
        other.arr = NULL;
    }


    __host__
    Array(T* other_arr, size_t size){
        cudaMalloc(&_size, sizeof(size_t));
        cudaMalloc(&arr, sizeof(T) * (size + 1));

        cudaMemcpy(_size, &size, sizeof(size_t), cudaMemcpyHostToDevice);
        cudaMemcpy(arr, other_arr, sizeof(T) * size, cudaMemcpyHostToDevice);
    }

    /**
     * Desctructor; dealocate heap
     */
    __host__
    ~Array(){
        cudaFree(_size);
        cudaFree(arr);
    }

    /**
     * Write access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    T &operator[](size_t i){
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }
    
    /**
     * Read only access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    const T &operator[](size_t i) const {
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /** 
     * Get array size
     * @return array size
     */
    __device__
    size_t size() const {
        return *_size;
    }

    /** 
     * Resize array droping stored values
     */
    __device__
    void resize(size_t n){
        delete[] arr;
        *_size = n;
        arr = new T[*_size + 1];
    }
}; // class Array

/**
 * Returns the smallest element from an array
 * @param a Array
 * @return smallest element of `a`
 */
template<typename T>
__device__
T min(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::min(m, a[i]);
    return m;
}

/**
 * Returns the larges element from an array
 * @param a Array
 * @return larges element of `a`
 */
template<typename T>
__device__
T max(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::max(m, a[i]);
    return m;
}

__global__ void k_sum_array(Array<int>* arr, int* s){
    
    *s = 0;
    for(size_t i = 0; i < arr->size(); i++)
        *s += arr->operator[](i);
}

int main(){
    int a[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    size_t size = 10;

    Array<int> arr(a, size);
    int* s;
    cudaMalloc(&s, sizeof(int));

    k_sum_array<<<1, 1>>>(&arr, s);

    // the next line was causing segfault because you cant access device data from the host (CUDA 101)
    // std::cout << *s << std::endl;

    int hs;
    cudaMemcpy(&hs, s, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << hs << std::endl;
    

    return 0;
}

It doesn't give the expected result. Any thought on how to achieve what I want?

You cannot allocate device heap memory from the host. The only way to do it is to use in-kernel `new` or in-kernel `malloc` (or in-kernel `cudaMalloc`). See [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations). Perhaps you didn't actually mean "device heap". — Robert Crovella, Aug 25 '22 at 14:46
The proximal cause of the seg fault is here: `std::cout << *s << std::endl;`, you cannot access device data in host code that way. That's a fairly basic CUDA principle. There are other issues with your code as well. — Robert Crovella, Aug 25 '22 at 14:59
@RobertCrovella, yes I forgot about that last thing, I fixed it and it no longer seg faults, still not giving the desired result. About the first comment, doesn't `cudaMalloc` in the host allocate memory in the device heap? — MaxWell, Aug 25 '22 at 15:07
@RobertCrovella can you tell me about the other issues in the code? Please. — MaxWell, Aug 25 '22 at 15:08
no, `cudaMalloc` doesn't allocate memory in the device heap. Please follow and read the link I provided. — Robert Crovella, Aug 25 '22 at 15:49

Robert Crovella · Accepted Answer · 2022-08-26T15:33:01.900

In CUDA, this won't work:

cudaMalloc(&s, sizeof(int));
...
std::cout << *s << std::endl;

You cannot access device memory from host code.

This is also problematic:

k_sum_array<<<1, 1>>>(&arr, s);
                      ^

The address of arr is a pointer to host memory. That is going to be useless in CUDA device code. Let's recap. In CUDA:

Host code cannot directly access (ordinary) device memory.
Device code cannot directly access (ordinary) host memory.

The first issue is fairly straightforward to fix. You have already edited your post to do that.

The second issue requires some refactoring, and I'm sure there are several ways to proceed at this point:

Use pass-by-pointer correctly (copy the object to the device first)
Use pass-by-value
Use managed memory
probably other methods

The thing I observe is that there is basically no need to pass arr by pointer, passing by value should be fine. CUDA handles that properly. But it can be somewhat involved.

If we convert to pass-by-value, then we need to refactor the device code accordingly. Additionally, pass-by-value in C++, when passing objects, creates an implicit object creation/destruction sequence around the function call, to support pass-by-value. This complicates operations around the kernel call. The object destructor will get called implicitly, and this sometimes trips people up. A simple solution is not to call cudaFree in the destructor. In addition, your object copy constructor is wrong (doesn't copy _size) and we will need an additional form of the copy-constructor due to the kernel call pass-by-value mechanism.

So in the interest of simplicity, I'll show a refactoring using pass-by-pointer.

The following code makes the change to provide the object as a proper entity in device memory. The only changes are in main around the handling of arr:

$ cat t2104.cu
#include <iostream>

template <typename T>
class Array{
    T *arr;
    size_t* _size;
public:

    /**
     * Move constructor; this allows the array to be passed
     * as a return value from a function sapping the pointers
     * and keeping the allocated data on the heap.
     */
    __device__
    Array(Array&& other){
        arr = other.arr;
        other.arr = NULL;
    }


    __host__
    Array(T* other_arr, size_t size){
        cudaMalloc(&_size, sizeof(size_t));
        cudaMalloc(&arr, sizeof(T) * (size + 1));

        cudaMemcpy(_size, &size, sizeof(size_t), cudaMemcpyHostToDevice);
        cudaMemcpy(arr, other_arr, sizeof(T) * size, cudaMemcpyHostToDevice);
    }

    /**
     * Desctructor; dealocate heap
     */
    __host__
    ~Array(){
        cudaFree(_size);
        cudaFree(arr);
    }

    /**
     * Write access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    T &operator[](size_t i){
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Read only access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    const T &operator[](size_t i) const {
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Get array size
     * @return array size
     */
    __device__
    size_t size() const {
        return *_size;
    }

    /**
     * Resize array droping stored values
     */
    __device__
    void resize(size_t n){
        delete[] arr;
        *_size = n;
        arr = new T[*_size + 1];
    }
}; // class Array

/**
 * Returns the smallest element from an array
 * @param a Array
 * @return smallest element of `a`
 */
template<typename T>
__device__
T min(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::min(m, a[i]);
    return m;
}

/**
 * Returns the larges element from an array
 * @param a Array
 * @return larges element of `a`
 */
template<typename T>
__device__
T max(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::max(m, a[i]);
    return m;
}

__global__ void k_sum_array(Array<int>* arr, int* s){

    *s = 0;
    for(size_t i = 0; i < arr->size(); i++)
        *s += arr->operator[](i);
}

int main(){
    int a[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    size_t size = 10;

    Array<int> arr(a, size);
    Array<int> *d_arr;
    cudaMalloc(&d_arr, sizeof(Array<int>));
    cudaMemcpy(d_arr, &arr, sizeof(Array<int>), cudaMemcpyHostToDevice);
    int* s;
    cudaMalloc(&s, sizeof(int));

    k_sum_array<<<1, 1>>>(d_arr, s);

    // the next line was causing segfault because you cant access device data from the host (CUDA 101)
    // std::cout << *s << std::endl;

    int hs;
    cudaMemcpy(&hs, s, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << hs << std::endl;


    return 0;
}
$ nvcc -o t2104 t2104.cu
$ compute-sanitizer ./t2104
========= COMPUTE-SANITIZER
55
========= ERROR SUMMARY: 0 errors
$

I'm not suggesting this fixes every possible defect in your code, merely that it seems to address the proximal issue(s) and seems to return the correct answer for the test case you have actually provided. I've already indicated that I don't think your copy-constructor is right, I haven't looked at other functions like min and max, and I definitely find your handling/definition of the array _size to be quite strange, but none of that seems to be relevant to your test case.

I'm also completely ignoring your usage of the "device heap" terminology. I don't think you are using that terminology in a fashion that is consistent with how CUDA defines that but it doesn't seem to be important to the discussion of the code you presented.

This is an example of a minimal set of refactoring to get your test case to work using pass-by-value. (Most changes are covered above.) Same disclaimers as above:

$ cat t2103.cu
#include <iostream>

template <typename T>
class Array{
    T *arr;
    size_t* _size;
public:

    /**
     * Move constructor; this allows the array to be passed
     * as a return value from a function sapping the pointers
     * and keeping the allocated data on the heap.
     */
   __device__
    Array(Array&& other){
        arr = other.arr;
        _size = other._size;
    //    other.arr = NULL;
    }
   __host__
    Array(Array& other){
        arr = other.arr;
        _size = other._size;
    //    other.arr = NULL;
    }


    __host__
    Array(T* other_arr, size_t size){
        cudaMalloc(&_size, sizeof(size_t));
        cudaMalloc(&arr, sizeof(T) * (size + 1));

        cudaMemcpy(_size, &size, sizeof(size_t), cudaMemcpyHostToDevice);
        cudaMemcpy(arr, other_arr, sizeof(T) * size, cudaMemcpyHostToDevice);
    }

    /**
     * Desctructor; dealocate heap
     */
    __host__
    ~Array(){
       // if (_size != NULL) cudaFree(_size); _size = NULL;
       // if (arr != NULL) cudaFree(arr); arr = NULL;
    }

    /**
     * Write access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    T &operator[](size_t i){
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Read only access to the array
     * @param i index
     * @return reference to i-th element
     */
    __device__
    const T &operator[](size_t i) const {
        if (i > *_size)
            return arr[*_size];
        return arr[i];
    }

    /**
     * Get array size
     * @return array size
     */
    __device__
    size_t size() const {
        return *_size;
    }

    /**
     * Resize array droping stored values
     */
    __device__
    void resize(size_t n){
        delete[] arr;
        *_size = n;
        arr = new T[*_size + 1];
    }
}; // class Array

/**
 * Returns the smallest element from an array
 * @param a Array
 * @return smallest element of `a`
 */
template<typename T>
__device__
T min(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::min(m, a[i]);
    return m;
}

/**
 * Returns the larges element from an array
 * @param a Array
 * @return larges element of `a`
 */
template<typename T>
__device__
T max(const Array<T>& a){
    T m = a[0];
    for(size_t i = 1; i < a.size(); i++)
        m = std::max(m, a[i]);
    return m;
}

__global__ void k_sum_array(Array<int> arr, int* s){

    *s = 0;
    for(size_t i = 0; i < arr.size(); i++)
        *s += arr[i];
}

int main(){
    int a[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    size_t size = 10;

    Array<int> arr(a, size);
    int* s;
    cudaMalloc(&s, sizeof(int));

    k_sum_array<<<1, 1>>>(arr, s);
   // std::cout << *s << std::endl;
    int s_h;
    cudaMemcpy(&s_h, s, sizeof(s[0]), cudaMemcpyDeviceToHost);
    std::cout << s_h << std::endl;
    return 0;
}
$ nvcc -o t2103 t2103.cu
$ compute-sanitizer ./t2103
========= COMPUTE-SANITIZER
55
========= ERROR SUMMARY: 0 errors
$

Thank you for your thorough answer. I'm a noob with CUDA so maybe I'm using the wrong terminology, I apologize. Here I'm trying to translate a C++ code into CUDA and I guess some things will need to be changed. My goal in this example is to send the data in an Array to the device so that it can be accessed from all the threads, so maybe I'm not using the "device heap" but it does what I need it to do. — MaxWell, Aug 25 '22 at 22:17
@MaxWell There are several ways of getting rid of the pesky C API calls for memory management in C++ code. There is [Thrust](https://github.com/NVIDIA/thrust) vectors, but Thrust might be a bit too broad/high level for some things, so I would recommend [RMM](https://github.com/rapidsai/rmm). — paleonix, Aug 26 '22 at 19:48
Why is there no memory leaks in the last example, the copy-by-value if there is no cudaFree called anywhere? — MaxWell, Aug 29 '22 at 02:28
all allocations are freed automatically on application termination. the pass-by-value mechanism doesn't create any new allocations. — Robert Crovella, Aug 29 '22 at 05:10

How to implement custom type that allocates in the __device__ heap from __host__ in CUDA C++?

1 Answers1

How to implement custom type that allocates in the device heap from host in CUDA C++?