0

I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.

For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...

__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
    *c = *a + *b;
}

... then modified the kernel to call the aforementioned function on the element identified by threadIdx...

__global__ void addKernel(int* c, const int* a, const int* b)
{
    const unsigned i = threadIdx.x;
    addSingleItem(c + i, a + i, b + i);
}

So now my application can check for the presence of a CUDA device. If one is found I can use...

addKernel <<<1, size>>> (dev_c, dev_a, dev_b);

... and if not I can forego parallelization and iterate through the elements calling the host version of the function...

int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;

for (int i = 0; i < arraySize; i++)
{
    addSingleItem(pC++, pA++, pB++);
}

Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?

dazedandconfused
  • 3,131
  • 1
  • 18
  • 29

1 Answers1

3

If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.

That is undoubtedly the best approach if it can be managed, for the reason described above.

If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.

Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.

Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.

This question may also be of interest, although only tangentially related here.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257