1

I'm trying to parallelize some function via CUDA, that is being called for many times. Each time it deals with the same matrix. I want to store this matrix in GPU memory and when function is called, I want to upload vector to GPU and multiply it by matrix and return the result. I prefer C++ template style, so thrust has higher priority.

Please recommend me some functions to do this and if possible some little illustrating samples. I don't provide the code not because it's a secret but because of its complexity and huge size.

sbeliakov
  • 2,169
  • 1
  • 20
  • 37
  • 1
    So you want to compute the dot product of a matrix with a vector? How large are they? And when you say "many times", are you want to compute these many products simultaneously on the GPU, or this is part of some iterative scheme like a solver where the vector changes from iteration to iteration? – talonmies Mar 18 '13 at 15:19
  • matrix is about 3000x100, and vector that is different from call to call is about 100. Dot product is not exactly what the function should do, but we can assume that it is, because real computations have the same complexity. I dont want to compute these "many products" simultaniously, by "many times" I underline that we should store this huge matrix at GPU. In fact we should do the same operation with given vector and each of 3000 raws of matrix and I want to parallel that 3000 computations. – sbeliakov Mar 18 '13 at 15:40
  • I want to use something like thrust::transform, that will take each of 3000 vectors and transform it to number by rule defined by given vector that changes from call to call – sbeliakov Mar 18 '13 at 15:43

1 Answers1

1

For thrust, the device_vector, device_ptr, ect, is what you are looking for.

From thrust::device_vector to raw pointer and back?

But in order to program the GPU efficiently, I suggest also being familiar with the CUDA memory types:

http://www.cvg.ethz.ch/teaching/2011spring/gpgpu/cuda_memory.pdf (pdf warning)

The type of memory you are looking for is "global memory". Remember all this memory is stored on the GPU card, not the CPU card, so it will only be available to kernels and device function calls.

All functor on device pointers just need to be compiled with the device tag (example unary op):

template <typename T>
struct square
{   
__host__ __device__
    T operator()(const T& x) const {
        return x * x;
}
}; 
Community
  • 1
  • 1
IdeaHat
  • 7,641
  • 1
  • 22
  • 53
  • And is it possible to deal with global memory from some functor, that is given then as argument for thrust::transform? – sbeliakov Mar 18 '13 at 15:09
  • Providing all your iterators passed to the function are device pointers, and the functor is tagged with the __device__ tag. I'll add an example to the answer...if you havn't been doing this, thrust is probably actually running on the CPU, so you'll probably see a performance boost by jumping to using device ptrs. – IdeaHat Mar 18 '13 at 15:24
  • 2
    @maggot092 Sounds rather like you're really looking for some introductory material on CUDA and GPU programming in the first place. Of course *thrust* can do much for you, but if you want to implement your own algorithms apart from the standard ones, you won't get around understanding the basic principles of GPU programming, which are best gathered from a good book or at least tutorial. – Christian Rau Mar 18 '13 at 15:36