How not to repeat myself without macros when writing similar CUDA kernels?

Question

I have several CUDA Kernels which are basically doing the same with some variations. What I would like to do is to reduce the amout of code needed. My first thought was to use macros, so my resulting kernels would look like this (simplified):

__global__ void kernelA( ... )
{
   INIT(); // macro to initialize variables

   // do specific stuff for kernelA
   b = a + c;

   END(); // macro to write back the result
}

__global__ void kernelB( ... )
{
   INIT(); // macro to initialize variables

   // do specific stuff for kernelB
   b = a - c;

   END(); // macro to write back the result
}
...

Since macros are nasty, ugly and evil I am looking for a better and cleaner way. Any suggestions?

(A switch statement would not do the job: In reality, the parts which are the same and the parts which are kernel specific are pretty interweaved. Several switch statements would be needed which would make the code pretty unreadable. Furthermore, function calls would not initialize the needed variables. )

(This question might be answerable for general C++ as well, just replace all 'CUDA kernel' with 'function' and remove '__global__' )

We need some more details to help you. If switch had been an option, @talonmies' answer to [this question](http://stackoverflow.com/a/6179580/1043187) would have probably helped you. — BenC, May 13 '13 at 08:40
OOP usually answers "Several switch statements [...] which would make the code pretty unreadable." with "create an object hierarchy and split switch branches into class specializations". Why would that not work? — utnapistim, May 13 '13 at 08:56
Who told you that macros are nasty, ugly and evil? They are one of the best features of C. Be not afraid to use them. Well, eh, you have C++ templates as well. — isti_spl, May 13 '13 at 13:46

Arne Mertz · Accepted Answer · 2013-05-13T10:57:14.107

Updated: I was told in the comments, that classes and inheritance don't mix well with CUDA. Therefore only the first part of the answer applies to CUDA, while the others are answer to the more general C++ part of your question.

For CUDA, you will have to use pure functions, "C-style":

struct KernelVars {
  int a;
  int b;
  int c;
};

__device__ void init(KernelVars& vars) {
  INIT(); //whatever the actual code is
}

__device__ void end(KernelVars& vars) {
  END(); //whatever the actual code is
}

__global__ void KernelA(...) {
  KernelVars vars;
  init(vars);
  b = a + c;
  end(vars);
}

This is the answer for general C++, where you would use OOP techniques like constructors and destructors (they are perfectly suited for those init/end pairs), or the template method pattern which can be used with other languages as well:

Using ctor/dtor and templates, "C++ Style":

class KernelBase {
protected:
  int a, b, c;

public:
  KernelBase() {
    INIT(); //replace by the contents of that macro
  }   
  ~KernelBase() {
    END();  //replace by the contents of that macro
  }
  virtual void run() = 0;
};

struct KernelAdd : KernelBase {
  void run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void run() { b = a - c; }
};

template<class K>
void kernel(...)
{
  K k;
  k.run();
}

void kernelA( ... ) { kernel<KernelAdd>(); }

Using template method pattern, general "OOP style"

class KernelBase {
  virtual void do_run() = 0;
protected:
  int a, b, c;
public:
  void run() { //the template method
    INIT(); 

    do_run();

    END();
  }
};

struct KernelAdd : KernelBase {
  void do_run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void do_run() { b = a - c; }
};

void kernelA(...)
{
  KernelAdd k;
  k.run();
}

Sadly no classes, no ctors, no dtors and no inheritance. But your pure C style is almost correct (only init and end must be `__device__´ functions). However +1! — Michael Haidl, May 13 '13 at 10:04
@kronos like I said, I don't know CUDA, so I did not know it's restrictions. You can see the solutions that involve classes as answer to the "general C++" part of the question. — Arne Mertz, May 13 '13 at 10:53
Thanks for your answer, putting all kernel variables in a struct and using __device__ functions will do the trick, I think. — Dirk, May 13 '13 at 13:30
Sorry my fault. Seams that I'm a little stuck on this topic on compute capability prior to 2.0. I must correct my self. There are classes, ctors and inheritance with minor or major limitations. — Michael Haidl, May 13 '13 at 14:00

score 1 · Answer 2 · answered May 13 '13 at 09:59

You can use device functions as "INIT()" and "END()" alternative.

__device__ int init()
{
    return threadIdx.x + blockIdx.x * blockDim.x;
}

Another alternative is to use function templates:

#define ADD 1
#define SUB 2

template <int __op__> __global__ void caluclate(float* a, float* b, float* c)
{
   // init code ...
switch (__op__)
{
case ADD:
  c[id] = a[id] + b[id];
break;
case SUB:
  c[id] = a[id] - b[id];
break;
    }
    // end code ...
}

and invoke them using:

calcualte<ADD><<<...>>>(a, b, c);

The CUDA compiler does the work, build the different function versions and removes the dead code parts for performance optimization.

How not to repeat myself without macros when writing similar CUDA kernels?

2 Answers2