Smart design for large kernel with different inputs that only changes one line of code

Question

I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical.

Something like

__global__ kernel(float * out , ....)
{

// 200 lines of math

// only difference, aside from input argument
idx=....
out[idx]=a;
}

vs

__global__ kernel(cudaSurfaceObject_t *  out, ...)
{

// 200 lines of math

// only difference, aside from input argument
  surf3Dwrite(&out,val,x,y,z);
}

What is the smart way of coding this, without copy pasting the entire kernel and renaming it? I checked Templating, but (if I am not wrong) its for types only, one can not just have a completely different line of code when the type is different in a template. CUDA kernels don't seem to be able to be overloaded either.

What do you mean by "as output"? `__global__` functions can't return anything, so what is an "output" in this case? Will different versions of the code have different arguments? Is [this](https://stackoverflow.com/a/6179580/681865) of any help? — talonmies, Jan 23 '20 at 18:34
@talonmies Yes indeed. I realize it was unclear. Hopefully the edit clarifies. I assume that template option is what I want indeed. Not super elegant, but may do the job. Does it allow different arguments for functions though? — Ander Biguri, Jan 23 '20 at 18:39
The edit helps clarify things a bit. This sounds like a classic example of where a bit of abstraction would solve your problem. Hide the different code in a small specialized function, keep the common code common, template the argument type. Shouldn't have to be more complex than that — talonmies, Jan 23 '20 at 18:48
Yes, I was thinking the same. I am not very experienced with all the C++ templating: I can't figure out how putting the different line into a function would change anything though, I still need to call this function differently for each template option right? — Ander Biguri, Jan 23 '20 at 18:58
@talonmies I do not know how to hide the different code in a function (as I seem to just be passing the issue to a smaller function), but I guess I could hide all the *similar* code in a function and write two different higher level functions... Again, doesn't feel like the best decision, but would do the job I guess. — Ander Biguri, Jan 23 '20 at 19:27

Robert Crovella · Accepted Answer · 2020-01-24T02:54:01.583

CUDA kernels don't seem to be able to be overloaded either.

It should be possible to overload kernels. Here is one possible approach, using overloading (and no templating):

$ cat t1648.cu
// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

__device__ float my_common(float *d, int width, unsigned int x, unsigned int y){

// 200 lines of common code...
  return d[y *width +x];
}




////////////////////////////////////////////////////////////////////////////////
// Kernels
////////////////////////////////////////////////////////////////////////////////
//! Write to a cuArray using surface writes
//! @param gIData input data in global memory
////////////////////////////////////////////////////////////////////////////////
__global__ void WriteKernel(float *gIData, int width, int height,
                                       cudaSurfaceObject_t outputSurface)
{
    // calculate surface coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to cuarray (via surface reference)
    surf2Dwrite(my_common(gIData, width, x, y),
                outputSurface, x*4, y, cudaBoundaryModeTrap);
}

__global__ void WriteKernel(float *gIData, int width, int height,
                                       float *out)
{
    // calculate coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to global memory
    out[y*width+x] = my_common(gIData, width, x, y);
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    printf("starting...\n");


    unsigned width = 256;
    unsigned height = 256;
    unsigned int size = width * height * sizeof(float);

    // Allocate device memory for result
    float *dData = NULL;
    checkCudaErrors(cudaMalloc((void **) &dData, size));

    // Allocate array and copy image data
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *cuArray;
    float *out;
    cudaMalloc(&out, size);
    checkCudaErrors(cudaMallocArray(&cuArray,
                                    &channelDesc,
                                    width,
                                    height,
                                    cudaArraySurfaceLoadStore));

    dim3 dimBlock(8, 8, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

    cudaSurfaceObject_t outputSurface;
    cudaResourceDesc    surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = cuArray;

    checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, outputSurface);
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, out);

    checkCudaErrors(cudaDestroySurfaceObject(outputSurface));
    checkCudaErrors(cudaFree(dData));
    checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -I/usr/local/cuda/samples/common/inc t1648.cu -o t1648
$

The above example was hacked together rapidly from the simpleSurfaceWrite CUDA sample code. It is not intended to be functional or run "correctly". It is designed to show how overloading can be used from a code structure standpoint to address the stated objective.

Thanks Robert. I must have had some small mistake in my previous tests, as I my compiler failed at overloading because the function as already defined. Should have checked better. This is definitely the right way to go. — Ander Biguri, Jan 24 '20 at 12:33

Smart design for large kernel with different inputs that only changes one line of code

1 Answers1

Linked