CUDA extending std::vector to manage host and device data

Question

I learned that std::vector is a nice wrapper around raw arrays in C++ so I started to use it for managing host data in my CUDA app [1]. Since having to allocate and copying things by hand makes the code more complex and less readable I thought about extending std::vector. Since I'm not very experienced I would like to know what you think about it. Especially weather it's correctly done (eg destructor of std::vector is called implicitly, right?) and if you consider it a good idea.

I wrote a small example illustrating this

#include <vector>
#include <cuda.h>

#include <cstdio>

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

// Wrapper around CUDA memory
template<class T>
class UniversalVector: public std::vector<T>
{
    T* devicePtr_;
    bool allocated;

public:

    // Constructor
    UniversalVector(unsigned int length)
        :std::vector<T>(length), 
         allocated(false)
    {}

    // Destructor
    ~UniversalVector()
     {
        if(allocated)
            cudaFree(devicePtr_);
     }

    cudaError_t allocateDevice()
    {
        if(allocated) free(devicePtr_);
        cudaError_t err = 
            cudaMalloc((void**)&devicePtr_, sizeof(T) * this->size());
        allocated = true;
        return err;
    }

    cudaError_t loadToDevice()
    {
        return cudaMemcpy(devicePtr_, &(*this)[0], sizeof(T) * this->size(),
            cudaMemcpyHostToDevice);
    }

    cudaError_t loadFromDevice()
    {
        return cudaMemcpy(&(*this)[0], devicePtr_, sizeof(T) * this->size(),
            cudaMemcpyDeviceToHost);
    }

    // Accessors

    inline T* devicePtr() {
        return devicePtr_;
    }

};

__global__ void kernel(int* a)
{
    int i = threadIdx.x;
    printf("%i\n", a[i]);
}

int main()
{
    UniversalVector<int> vec(3);
    vec.at(0) = 1;
    vec.at(1) = 2;
    vec.at(2) = 3;

    vec.allocateDevice();
    vec.loadToDevice();

    kernel<<<1, 3>>>(vec.devicePtr());

    checkCUDAError("Error when doing something");

    return 0;
}

[1] In CUDA it's distinguished between host and device memory where host memory is the memory accessible by the GPU and device memory the memory on the GPU The programmer has to move memory from the host to the GPU and back.

In general, it is a bad idea to extend STL containers. You would be better of in most cases by using composition and providing a simple façade for the methods in the STL container that you actually use. — David Rodríguez - dribeas, May 24 '11 at 18:03
@Heandel No I just want the code on the host side a bit cleaner.. Are dynamic sized arrays on the device even possible? — Nils, May 24 '11 at 18:06
@David Thx for the comment, but what exactly could cause a problem, I just was too lazy to write a facade which wraps [], .at() and resize().. — Nils, May 24 '11 at 18:07

score 8 · Answer 1 · answered May 24 '11 at 17:55

8

You might want to have a look at Thrust. It provides some STL containers for CUDA code.

answered May 24 '11 at 17:55

Steve Fallows

6,274
5
47
67

Did I ask for a library? I know about thrust and thrust is too heavy for my needs, it completely abstracts memory allocation and copy on the device, which is not what I intend to do. All I want is a cleaner code which is less error prone.. – Nils May 24 '11 at 17:58
Well this is also supposed to be a host side abstraction.. – Nils May 24 '11 at 18:03
5

You don't have to use all of Thrust. Just #include and and use only those. They are work-alikes for std::vector with the extensions you are looking for: you can copy from host to device just with operator=. – harrism May 25 '11 at 03:18
3

Second recommendation about Thrust. The header only nature means it's not heavy at all. It also nicely handles the host/device locality etc. like you wanted. No sense in reinventing the wheel. Only downside to Thrust is that it doesn't seem to take advantage of the unified memory model introduced in CUDA 4.0. That should change with the next release though. – peakxu May 25 '11 at 12:11
Just figured out how to use thrust for sorting and counting, it's pretty easy :) I think I might rely more on thrust in the future. – Nils Jun 05 '11 at 12:22

score 5 · Accepted Answer · answered May 24 '11 at 18:48

The biggest problem I see with this is that is doesn't really help manage the GPU side of things very much, and it obfuscates a number of very important pieces of information in the process.

While the container class contains information about whether the device pointer has been allocated, there is no way of knowing whether the contents of the host container has been copied to the GPU memory it holds, or whether the GPU memory has been copied back to the device. As a result you will have to call the loadToDevice() and loadFromDevice() methods every time you wish to use the container in either host or device code. That probably means unnecessary PCI-e memory transfers at least some of the time. And because you have chosen to wrap only the synchronous CUDA memory copy routines, there will be host blocking every time you do this.

Ultimately I don't see much net gain in this idea over a well designed set of helper routines which abstract away the ugliest bits of the CUDA APIs and operate on standard STL types.

score 2 · Answer 3 · edited May 23 '17 at 12:15

I'll extend David Rodríguez - dribeas comment a bit:

The question why you should prefer composition over inheritance (even though it requires additional façade work) has been asked and answered multiple times. A good answer is this one: Prefer composition over inheritance?

The determining factor is the interface: Do you want all or some methods of the underlying class?

In your case std::vector methods that modify the size of the vector like resize, push_back, pop_back, erase, insert, etc. are likely to cause mayhem if called between the call of loadToDevice and loadFromDevice.

In your question you state that you need a wrapper around raw arrays. Those are of fixed size! Therefore you might very well use a std::vector in your wrapper class internally (composition!), but you need to hide away all the dynamic size stuff of it.

score 1 · Answer 4 · answered May 24 '11 at 18:06

1

You better off having those functions like allocateDevice and loadToDevice as free function rather than members of class inherited from std::vector. It could possible save you a great deal of integrating other libraries/classes with your stuff. Overall looks good.

answered May 24 '11 at 18:06

1

Seconded. They can be templates for any container as long as you can allocateDevice() them and copy between device and host representations (using iterators, for example). – berkus May 24 '11 at 18:30

CUDA extending std::vector to manage host and device data

4 Answers4