1

all tutorials and introductional material for GPGPU/Cuda often use flat arrays, however I'm trying to port a piece of code which uses somewhat more sophisticated objects compared to an array.

I have a 3-dimensional std::vector whose data I want to have on the GPU. Which strategies are there to get this on the GPU?

I can think of 1 for now:

  1. copy the vector's data on the host to a more simplistic structure like an array. However this seems wasteful because 1) I have to copy data and then send to the GPU; and 2) I have to allocate a 3-dimensional array whose dimensions are the max of the the element count in any of the vectors e.g. using a 2D vector

imagine {{1, 2, 3, 4, .. 1000}, {1}}, In the host memory these are roughly ~1001 allocated items, whereas if I were to copy this to a 2 dimensional array, I would have to allocate 1000*1000 elements.

Are there better strategies?

hbogert
  • 4,198
  • 5
  • 24
  • 38
  • 2
    You don't have to allocate a 3D array of max dimensions. You only need a 1-D (ie. flat) array of length equal to the number of *elements* in your 3D std::vector. And you need a companion (1D, probably) array of the starting points of each sub-vector. The net storage requirements on the GPU probably end up being similar to the storage requirements of the 3D std::vector on the host. There are many ways to refactor data to suit the GPU. This question is quite broad. – Robert Crovella Apr 30 '14 at 15:41
  • Thank you Robert, for the first part your comment might well have been an (accepted) answer. But you are thus confirming that refactoring the data structures on the host is needed? I would've hoped there was some (cuda)function that would take objects (class or struct) and would efficiently deep-copy actual values to a device without copying in host memory first. – hbogert May 01 '14 at 05:31
  • The closest I can suggest to automatic deep copying would be Unified Memory. But that still involves *some* code refactoring, has some specific requirements, and isn't necessarily a high-performance approach at this time. I posted an answer with some suggestions of things to investigate. Although refactoring of `std::vector` may not matter, but many types of data structures *should* be refactored to yield higher performance on the GPU. – Robert Crovella May 01 '14 at 13:29

1 Answers1

2

There are many methodologies for refactoring data to suit GPU computation, one of the challenges being copying data between device and host, the other challenge being representation of data (and also algorithm design) on the GPU to yield efficient use of memory bandwidth. I'll highlight 3 general approaches, focusing on ease of copying data between host and device.

  1. Since you mention std::vector, you might take a look at thrust which has vector container representations that are compatible with GPU computing. However thrust won't conveniently handle vectors of vectors AFAIK, which is what I interpret to be your "3D std::vector" nomenclature. So some (non-trivial) refactoring will still be involved. And thrust still doesn't let you use a vector directly in ordinary CUDA device code, although the data they contain is usable.

  2. You could manually refactor the vector of vectors into flat (1D) arrays. You'll need one array for the data elements (length = total number of elements contained in your "3D" std::vector), plus one or more additional (1D) vectors to store the start (and implicitly the end) points of each individual sub-vector. Yes, folks will say this is inefficient because it involves indirection or pointer chasing, but so does the use of vector containers on the host. I would suggest that getting your algorithm working first is more important than worrying about one level of indirection in some aspects of your data access.

  3. as you point out, the "deep-copy" issue with CUDA can be a tedious one. It's pretty new, but you might want to take a look at Unified Memory, which is available on 64-bit windows and linux platforms, under CUDA 6, with a Kepler (cc 3.0) or newer GPU. With C++ especially, UM can be very powerful because we can extend operators like new under the hood and provide almost seamless usage of UM for shared host/device allocations.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • In my case 2. is the only viable option, it's a big application in which only a very expensive operation will be offloaded so I can't convert it to thrust and hope it will work for everyone else. 3. Has come to my attention as well but I only have the availability of Fermi hardware. Ergo, I'll have to make a flatten method – hbogert May 01 '14 at 14:00