2

I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):

element_sizes = [ 10, 100, 23, 45 ]

And now, for the next stage of GPU calculation, I need the following values:

memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]

I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:

// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);

int total_size = 0;
for(int i = 0; i < ...; ...){
    host_offsets[i] = total_size;
    total_size += host_element_sizes[i];
}

device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);

However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.

What is the expected way for a CUDA programmer to solve this problem?

harrism
  • 26,505
  • 2
  • 57
  • 88
Chris
  • 28,822
  • 27
  • 83
  • 158
  • @RobertCrovella Ok will check it out and delete or mark as duplicate. In transit atm – Chris Nov 06 '18 at 23:27
  • @RobertCrovella and thank you! This has been a headache. – Chris Nov 06 '18 at 23:27
  • @RobertCrovella in my case it is necessary to preserve order. It looks like generating a solution using this method would require an index list, as I need to know which element offset is in which location. – Chris Nov 07 '18 at 11:43
  • If you want to preserve order, I would compute everything using a prefix sum on the `element_sizes` array. The prefix sum itself will give you the `memory_offsets` array, and the last value of the prefix sum will be the sum reduction i.e. `memory_size`. You haven't really shown how your `element_sizes` array is generated, so I'm not sure I can give you a precise example. But if your `element_sizes` array is in GPU global memory, producing a prefix sum on it is a trivial operation - a library call from thrust for example. – Robert Crovella Nov 07 '18 at 14:44
  • 1
    This is a great question to have on SO: not everyone knows what a prefix sum is. I've edited the question title to add some key words ("all partial sums") to help others find the question. – harrism Nov 08 '18 at 00:29

1 Answers1

2

I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.

If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.

Here's a trivial worked example using thrust:

$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>



int main(){

  const int element_sizes[] = { 10, 100, 23, 45 };
  const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
  thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
  thrust::device_vector<int> dv_mo(ds);
  thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
  std::cout << "element_sizes:" << std::endl;
  thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "memory_offsets:" << std::endl;
  thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • So I am working inside kernels. Can I call a thrust function within a kernel? – Chris Nov 07 '18 at 15:34
  • Nevermind, found my answer here: https://thrust.github.io/doc/group__prefixsums.html (looks like there is a host function overload) – Chris Nov 07 '18 at 15:35
  • 1
    Yes, thrust functions [can be called from device code](https://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix/28254765#28254765). However the [cub](https://nvlabs.github.io/cub/classcub_1_1_block_scan.html) library may give you more flexibility. It also has scan operations that can be done device-wide, block-wide, or at even smaller granularities, all callable from device code. cub also gives you more direct control over handling any temporary alocations needed by the algorithm. – Robert Crovella Nov 07 '18 at 15:37