I think template metaprogramming is probably the only realistic way of doing what it seems you want (the rationale for why you actually want to do this isn't very obvious, but that is another question). There isn't any other way I am aware of for declaring a "variable" length local memory array, given that local memory requires static compilation as part of the per thread stack frame.
Of course, instantiating and selecting many different versions of the same template function isn't much fun, but you can use something like boost preprocessor to automate all the tedium.
For example, consider the following simple kernel which looks a lot like the model you describe in your question:
#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>
template<int N>
__global__ void kernel(int *out, int Nout)
{
int scratch[N];
for(int i=0; i<N; i++)
scratch[i] = i - Nout;
if (Nout > 1) {
out[threadIdx.x] = scratch[Nout];
}
}
#define PRED(r, state) \
BOOST_PP_NOT_EQUAL( \
BOOST_PP_TUPLE_ELEM(2, 0, state), \
BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) \
) \
/**/
#define OP(r, state) \
( \
BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), \
BOOST_PP_TUPLE_ELEM(2, 1, state) \
) \
/**/
#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));
BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>
Here I have used BOOST_PP_FOR
to generate 10 different instances of the basic kernel automagically:
>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..\boost_1_60_0 template.cu
template.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi13EEvPii
56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi17EEvPii
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi15EEvPii
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi19EEvPii
80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi11EEvPii
48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi16EEvPii
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi20EEvPii
80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi12EEvPii
48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi14EEvPii
56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi18EEvPii
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info : Function properties for _Z6kernelILi10EEvPii
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 44 bytes cmem[0]
You can also use the same automation to generate a host wrapper function which selects the correct instance at runtime. While it isn't ideal, it is portable, fast and works fine with the CUDA toolchain.