Using function-templated code across the g++-nvcc boundary (including kernels)

Question

Suppose I compile the following with NVIDIA CUDA's nvcc compiler:

template<typename T, typename Operator>
__global__ void fooKernel(T t1, T t2)  {
    Operator op;
    doSomethingWith(t1, t2);
}

template<typename T>
__device__ __host__ void T bar(T t1, T t2)  {
    return t1 + t2;
}

template<typename T, typename Operator>
void foo(T t1, T t2)  {
    fooKernel<<<2, 2>>>(t1, t2);
}

// explicit instantiation
template decltype(foo<int, bar<int>>) foo<int, bar<int>);

Now, I want my gcc, non-nvcc code to call foo():

...

template<typename T, typename Operator> void foo(T t1, T t2);


foo<int, bar<int>> (123, 456);
...

I have the appropriate (?) instantiation in the .o/.a/.so file I compile with CUDA.

Can I make that happen?

Why is `bar` defined twice in the first snippet? And what is the real question here? Why would you want to apply the `decltype` keyword (if it is supported in CUDA which I'm not sure about) in a `void`function that doesn't return a value? What are you expecting that will achieve? — talonmies, Feb 09 '15 at 15:38
@talonmies: 1. mis-copy-paste. 2. See [here](http://stackoverflow.com/questions/28354752/template-vs-template-without-brackets-whats-the-difference); that will achieve an explicit instantiation. Will clear it up. — einpoklum, Feb 09 '15 at 18:43

score 2 · Accepted Answer · edited Feb 09 '15 at 20:51

The problem here is that templated code is typically instantiated at the place of usage, which doesn't work because foo contains a kernel call which cannot be parsed by g++. Your approach of explicitly instantiating the template and forward declaring it for the host compiler is the right one. Here's how to do this. I slightly fixed up your code and split it into 3 files:

gpu.cu
gpu.cuh
cpu.cpp

gpu.cuh

This file contains the templated code for use by gpu.cu. I added some purpose to your foo() function to make sure it works.

#pragma once
#include <cuda_runtime.h>

template <typename T>
struct bar {
    __device__ __host__ T operator()(T t1, T t2)
    {
        return t1 + t2;
    }
};

template <template <typename> class Operator, typename T>
__global__ void fooKernel(T t1, T t2, T* t3)
{
    Operator<T> op;
    *t3 = op(t1, t2);
}

template <template <typename> class Operator, typename T>
T foo(T t1, T t2)
{
    T* t3_d;
    T t3_h;
    cudaMalloc(&t3_d, sizeof(*t3_d));
    fooKernel<Operator><<<1, 1>>>(t1, t2, t3_d);
    cudaMemcpy(&t3_h, t3_d, sizeof(*t3_d), cudaMemcpyDeviceToHost);
    cudaFree(t3_d);
    return t3_h;
}

gpu.cu

This file only instantiates the foo() function to make sure it will be available for linking:

#include "gpu.cuh"

template int foo<bar>(int, int);

cpu.cpp

In this plain C++ source file, we need to make sure we do not get the template instantiations, as that would give a compile error. Instead we only forward declare the struct bar and the function foo. The code looks like this:

#include <cstdio>

template <template <typename> class Operator, typename T>
T foo(T t1, T t2);

template <typename T>
struct bar;

int main()
{
    printf("%d \n", foo<bar>(3, 4));
}

Makefile

This will put the code all together into an executable:

.PHONY: clean all
all: main

clean:
        rm -f *.o main

main: gpu.o cpu.o
        g++ -L/usr/local/cuda/lib64 $^ -lcudart -o $@

gpu.o: gpu.cu
        nvcc -c -arch=sm_20 $< -o $@

cpu.o: cpu.cpp
        g++ -c $< -o $@

Device code is compiled by nvcc, host code by g++ and it all gets linked by g++. Upon running you see the beautiful result:

The key thing to remember here is that kernel launches and kernel definitions have to be in the .cu files that are compiled by nvcc. For future reference, I will also leave this link here, on separation of linking and compilation with CUDA.

Can you explain the need/the importance of the template-within-a-template you used? — einpoklum, Feb 09 '15 at 20:55
It is not important, it just simplifies writing. This way you don't have to pass a specific instantiation of the template to the function `foo >` (need the space between the angle brackets pre c++11), but you can simply use `foo`. The whole thing would work with normal template parameters too. — jepio, Feb 09 '15 at 20:58
Could I trouble you to link to somewhere explaining this point in more detail? — einpoklum, Feb 09 '15 at 22:20
I would have to redirect you to [this so question](http://stackoverflow.com/questions/213761/what-are-some-uses-of-template-template-parameters-in-c) or the *template template parameters* section of [cppreference](http://en.cppreference.com/w/cpp/language/template_parameters). — jepio, Feb 09 '15 at 22:27

Using function-templated code across the g++-nvcc boundary (including kernels)

1 Answers1

gpu.cuh

gpu.cu

cpu.cpp

Makefile