7

I am porting some physics simulation code from C++ to CUDA.

The fundamental algorithm can be understood as: applying an operator to each element of a vector. In pseudocode, a simulation might include the following kernel call:

apply(Operator o, Vector v){
    ...
}

For instance:

apply(add_three_operator, some_vector)

would add three to each element in the vector.

In my C++ code, I have an abstract base class Operator, with many different concrete implementations. The important method is class Operator{ virtual double operate(double x) =0; Operator compose(Operator lo, Operator ro); ... }

The implementation for AddOperator might look like this:

class AddOperator : public Operator{
    private:
        double to_add;
    public:
        AddOperator(double to_add): to_add(to_add){}
        double operator(double x){
            return x + to_add;
        }
};

The operator class has methods for scaling and composing concrete implementations of Operator. This abstraction allows me to simply compose "leaf" operators into more general transformations.

For instance:

apply(compose(add_three_operator, square_operator), some_vector);

would add three then square each element of the vector.

The problem is CUDA doesn't support virtual method calls in the kernel. My current thought is to use templates. Then kernel calls will look something like:

apply<Composition<AddOperator,SquareOperator>>
    (compose(add_three_operator, square_operator), some_vector);

Any suggestions?

  • 6
    I believe `virtual` functions require compiling with `-arch=sm_20` or higher. However, I'd recommend isolating your polymorphism to the host code which launches the kernel. Even if you eventually got things compiling I'd expect the performance of virtual function dispatch in SIMD code will be disappointing. – Jared Hoberock Jul 24 '13 at 02:26
  • 5
    I agree with Jared. Even on the CPU, if the same operations are being applied to every element of large vectors, I would consider refactoring so that the polymorphism is at a higher level, and the virtual method calls are not in your inner loops. Once you do that, parallelizing will be much more performant (in CUDA, OpenMP, or whatever). You might also consider Thrust for this. – harrism Jul 24 '13 at 02:40
  • Thanks for the feedback. I actually already am using Thrust. I'm going ahead with templates. – user2611717 Jul 24 '13 at 07:07
  • 3
    So is there a question we can actually answer? – harrism Jul 25 '13 at 04:11

1 Answers1

2

Something like this perhaps...

template <class Op1, class Op2>
class Composition {...}

template <class Op1, class Op2>
Composition<Op1, Op2> compose(Op1& op1, Op2& op2) {...}

template<class C>
void apply(C& c, VecType& vec){...}
tunc
  • 519
  • 6
  • 18