4

Summary

Array [A - B - - - C] in device memory but want [A B C] - what's the quickest way with CUDA C?

Context

I have an array A of integers on device (GPU) memory. At each iteration, I randomly choose a few elements that are larger than 0 and subtract 1 from them. I maintain a sorted lookup array L of those elements that are equal to 0:

Array A:
       @ iteration i: [0 1 0 3 3 2 0 1 2 3]
   @ iteration i + 1: [0 0 0 3 2 2 0 1 2 3]

Lookup for 0-elements L:
       @ iteration i: [0 - 2 - - - 6 - - -]  ->  want compacted form: [0 2 6]
   @ iteration i + 1: [0 1 2 - - - 6 - - -]  ->  want compacted form: [0 1 2 6]

(Here, I randomly chose elements 1 and 4 to subtract 1 from. In my implementation in CUDA C, each thread maps onto an element in A, and so the lookup array is sparse to prevent data races and to maintain a sorted ordering (e.g. [0 1 2 6] rather than [0 2 6 1]).)

Later, I will do some operation only for those elements that are equal to 0. Hence I need to compact my sparse lookup array L, so that I can map threads to 0-elements.

As such, what is the most efficient way to compact a sparse array on device memory with CUDA C?

Many thanks.

mchen
  • 9,808
  • 17
  • 72
  • 125
  • 2
    You might consider using [thrust stream compaction](http://docs.thrust.googlecode.com/hg/group__stream__compaction.html). – Robert Crovella Jan 10 '13 at 12:45
  • Thanks - does thrust come with the standard CUDA installation? As I'm not the system administrator, how can I check on a Unix machine if the library is available? Thanks. – mchen Jan 10 '13 at 12:56
  • Yes, it does, assuming a recent version of CUDA. If you have a directory like `/usr/local/cuda/include/thrust` then you have thrust. Thrust is entirely templated/included code, so there are no ordinary libraries to worry about. You might be interested in the [quick start guide](https://github.com/thrust/thrust/wiki/Quick-Start-Guide). – Robert Crovella Jan 10 '13 at 13:07
  • Thanks @RobertCrovella, but I can't see any example usage for C users - only C++ which I'm not familiar with. For instance, how would you even call `thrust::copy_if()` on an array in device memory in CUDA C? – mchen Jan 10 '13 at 18:47
  • [cuSPARSE](http://docs.nvidia.com/cuda/cusparse/index.html#topic_11_10) library provide `cusparseSdense2csr()` to convert matrix from dense to sparse format. It should be very efficient, but maybe less efficient than `thrust::copy_if` – kangshiyin Jan 10 '13 at 18:49
  • Thanks @EricShiyinKang, but it would be most helpful if you could give an actual example of how to use either `cusparseSdense2csr()` or `thrust::copy_if()`? Say, I have `[1 2 0 0 5]` in device memory and I want `[1 2 5]`. Thanks – mchen Jan 10 '13 at 18:59
  • Yes, thrust is a c++ template library. That doesn't prevent it's use according to your question. I'll give an example as an answer. – Robert Crovella Jan 10 '13 at 19:12

1 Answers1

3

Suppose I have:

int V[] = {1, 2, 0, 0, 5};

And my desired result is:

int R[] = {1, 2, 5}

In effect we are removing elements that are zero, or copying elements only if non-zero.

#include <thrust/device_ptr.h>
#include <thrust/copy.h>
#include <stdio.h>
#define SIZE 5

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

  struct is_not_zero
  {
    __host__ __device__
    bool operator()(const int x)
    {
      return (x != 0);
    }
  };



int main(){

  int V[] = {1, 2, 0, 0, 5};
  int R[] = {0, 0, 0, 0, 0};
  int *d_V, *d_R;

  cudaMalloc((void **)&d_V, SIZE*sizeof(int));
  cudaCheckErrors("cudaMalloc1 fail");
  cudaMalloc((void **)&d_R, SIZE*sizeof(int));
  cudaCheckErrors("cudaMalloc2 fail");

  cudaMemcpy(d_V, V, SIZE*sizeof(int), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy1 fail");

  thrust::device_ptr<int> dp_V(d_V);
  thrust::device_ptr<int> dp_R(d_R);
  thrust::copy_if(dp_V, dp_V + SIZE, dp_R, is_not_zero());

  cudaMemcpy(R, d_R, SIZE*sizeof(int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy2 fail");

  for (int i = 0; i<3; i++)
    printf("R[%d]: %d\n", i, R[i]);

  return 0;


}

the struct defintion provides us with a functor that tests for zero elements. Note that in thrust, there are no kernels and we are not writing device code directly. All that happens behind the scenes. And I'd definitely suggest familiarizing yourself with the quick start guide, so as not to turn this question into a tutorial on thrust.

After reviewing the comments, I think this modified version of the code will work around the cuda 4.0 issues:

#include <thrust/device_ptr.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <stdio.h>
#define SIZE 5

  struct is_not_zero
  {
    __host__ __device__
    bool operator()(const int x)
    {
      return (x != 0);
    }
  };



int main(){

  int V[] = {1, 2, 0, 0, 5};
  int R[] = {0, 0, 0, 0, 0};

  thrust::host_vector<int> h_V(V, V+SIZE);
  thrust::device_vector<int> d_V = h_V;
  thrust::device_vector<int> d_R(SIZE, 0);

  thrust::copy_if(d_V.begin(), d_V.end(), d_R.begin(), is_not_zero());
  thrust::host_vector<int> h_R = d_R;

  thrust::copy(h_R.begin(), h_R.end(), R);

  for (int i = 0; i<3; i++)
    printf("R[%d]: %d\n", i, R[i]);

  return 0;


}
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks. I tried your solution, but get compile error: `[...]/cuda/4.0.17/cuda/bin/../include/thrust/detail/device/cuda/copy_i‌​f.inl(71): error: more than one instance of overloaded function "min" matches the argument list: function "min(int, int)" function "min(unsigned int, unsigned int)" [...] argument types are: (long, const long) detected during: instantiation of "void thrust::detail::device::cuda::reduce_intervals(InputIterator, IndexType, IndexType, OutputIterator, BinaryFunction) [...]` where, `[...]` are truncations. – mchen Jan 10 '13 at 22:28
  • You took the code I posted, exactly, and tried to compile it? Or did you make any changes or additions? It looks like you're using CUDA 4.0. I have tested it on cuda 4.2 and cuda 5.0, but not 4.0 – Robert Crovella Jan 11 '13 at 00:43
  • Thanks Robert, I compiled the code exactly as you posted. Any ideas why CUDA 4.0's complaining? – mchen Jan 11 '13 at 02:26
  • cuda 4 is pretty old. Over 2 years old now. Try adding -m32 to your nvcc compile command line. – Robert Crovella Jan 11 '13 at 03:22
  • Cheers. Now I get the error: `In file included from /usr/include/features.h:371, from [...]/cuda/4.0.17/cuda/bin/../include/host_config.h:114, from [...]/cuda/4.0.17/cuda/bin/../include/cuda_runtime.h:59, from :0: /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory`. Thanks for your patience. – mchen Jan 11 '13 at 03:26
  • This is [now a linux issue](http://stackoverflow.com/questions/7412548/gnu-stubs-32-h-no-such-file-or-directory). Please give a complete description of your system, including GPU, machine type (is it a laptop, desktop, server, etc?) Linux distro, and whether or not you have super user privileges. If you have super user privileges, it's fixable, but the more sensible route is to upgrade your cuda install. If you don't have superuser privileges, I think you're stuck. At least I don't have any solution for it. – Robert Crovella Jan 11 '13 at 03:53
  • I've added a modified version of my code that I think will work around the cuda 4 issues. You don't need to compile this with -m32 – Robert Crovella Jan 11 '13 at 15:17
  • Yes, it compiles now. Thanks. – mchen Jan 11 '13 at 21:04
  • What if I wanted to copy only the part that is not 0? – Sidney May 02 '16 at 20:19
  • I don't understand the question. The code I provided does exactly that - it copies only the elements that are not 0. Maybe you should ask a new question, rather than trying to sort it out in the comments of an old one. You can link back to this question for context if it helps. – Robert Crovella May 02 '16 at 20:29