Achieving the equivalent of a variable-length (local) array in CUDA

Question

I have some code which uses local memory (I might have used registers, but I need dynamic addressing). Since the amount of memory I use depends on the input and on the number of threads in the block (which also depends on the input, at run-time, although before launch-time) - it can't be a fixed-size array. On the other hand, I can't write

__global__ foo(short x)
{
    int my_local_mem_array[x];
}

(which is valid but problematic C99, but not valid C++ even on the host side.)

How can I achieve the same effect? So far my thought has been to template the kernel on the memory size and just invoke it with the maximum possible proper-L1 memory on a core, using only as much as I need. But that's kind of ugly, since it would mean I would have to multiply the number of instantiations by the different possible maximum memory sizes. Ugh.

What is the range of possible `x` values you would need to have? — talonmies, Mar 05 '16 at 10:40
@talonmies: Depends on what I can realistically expect to keep in L1 cache. So I guess up to 48K / block size on a good day. — einpoklum, Mar 05 '16 at 12:07
Will `x` * block size be small enough to fit into a dynamically-sized `__shared__` array? — Jared Hoberock, Mar 06 '16 at 07:36
@einpoklum: `__shared__` memory sounds like one way to get everything you want: 1. on-chip storage, 2. dynamic allocation, & 3. dynamic indexing. You could even dynamically allocate it within the kernel if you use something like [this shmalloc](https://github.com/jaredhoberock/shmalloc) code. — Jared Hoberock, Mar 06 '16 at 17:33
@JaredHoberock: I've actually reached that conclusion at about the time I was formulating this question, but I'm still interested in how this can be done. Also, on Kepler, you have an extra 16KB of L1 besides the shared memory. — einpoklum, Mar 06 '16 at 17:48
@einpoklum: You need something like `alloca`, but nothing like that exists in CUDA or even ISO C++ as far as I know. Maybe there's a way to manipulate the stack pointer using inline PTX (I doubt there is). Or maybe there's some trick that would let you indirectly manipulate the stack by using a C-style varargs function with ellipsis? — Jared Hoberock, Mar 06 '16 at 21:24
@JaredHoberock: I'll put that down as "not humanly possible" for now, thank you :-) — einpoklum, Mar 06 '16 at 21:50

talonmies · Answer 1 · 2016-03-06T19:20:23.317

I think template metaprogramming is probably the only realistic way of doing what it seems you want (the rationale for why you actually want to do this isn't very obvious, but that is another question). There isn't any other way I am aware of for declaring a "variable" length local memory array, given that local memory requires static compilation as part of the per thread stack frame.

Of course, instantiating and selecting many different versions of the same template function isn't much fun, but you can use something like boost preprocessor to automate all the tedium.

For example, consider the following simple kernel which looks a lot like the model you describe in your question:

#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>

template<int N>
__global__ void kernel(int *out, int Nout)
{
    int scratch[N];
    for(int i=0; i<N; i++)
        scratch[i] = i - Nout;

    if (Nout > 1) {
       out[threadIdx.x] = scratch[Nout];
    }
}

#define PRED(r, state) \
   BOOST_PP_NOT_EQUAL( \
      BOOST_PP_TUPLE_ELEM(2, 0, state), \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) \
   ) \
   /**/

#define OP(r, state) \
   ( \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), \
      BOOST_PP_TUPLE_ELEM(2, 1, state) \
   ) \
   /**/

#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));

BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>

Here I have used BOOST_PP_FOR to generate 10 different instances of the basic kernel automagically:

>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..\boost_1_60_0 template.cu

template.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi13EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi17EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi15EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi19EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi11EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi16EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi20EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi12EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi14EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi18EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi10EEvPii
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]

You can also use the same automation to generate a host wrapper function which selects the correct instance at runtime. While it isn't ideal, it is portable, fast and works fine with the CUDA toolchain.

score 0 · Answer 2 · answered Mar 05 '16 at 10:29

0

Just use global memory allocated by the host using cudaMalloc. Any array access that is not fully defined at compile time will cause CUDA to use "local memory" which, despite the name, is just global memory. Alternatively you can use new or __device__ malloc.

answered Mar 05 '16 at 10:29

Joe

6,497
4
29
55

I don't understand your second sentence. Compiler spilling to local memory only happens for statically sized, local scope variables. How would that mechanism have anything to do with a generic global memory pointer passed as an argument to the kernel? – talonmies Mar 05 '16 at 10:43
Local memory is only an abstraction. The allocation is physically in global memory. – Joe Mar 05 '16 at 10:45
@Joe: It's not _only_ an abstraction. Caching behaves differently; and managing it is different. Also, how would malloc'ing an array in advance help me? I'd have to allocate the maximum necessary anyway; might as well do that with static sizing. – einpoklum Mar 05 '16 at 12:34
On 1.x it is only an abstraction. Kepler reserves L1 for locals. But Maxwell now puts locals in L2 only like globals. And yes, you'd have to allocate the maximum. But as I point out in the last sentence you can also allocate global memory from within the kernel on 2.x hardware. Just tends to be on the slow side. – Joe Mar 05 '16 at 12:58
Local memory is *not* just an abstraction. Yes it is global memory, but the compilation behaviour is not the same. Non-warp divergent access to local memory is automatically coalesced, for example. – talonmies Mar 05 '16 at 14:27

Achieving the equivalent of a variable-length (local) array in CUDA

2 Answers2

Linked