Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

Question

Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a #DEFINE.

The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths.

My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels?

The constants are 2 or 3 element length float* or int* arrays. They will be around 5~10 of these.

toy example: 2 constants const1 and const2

__global__ void kernelToyExample(int inputdata, ?????){
        value=inputdata*const1[0]+const2[1]/const1[2];
}

is it better

__global__ void kernelToyExample(int inputdata, float* const1, float* const2){
        value=inputdata*const1[0]+const2[1]/const1[2];
}

or

__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){
        value=inputdata*const1x+const2y/const1z;
}

or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one?

Is there a better way I don't know of?

Running on a Tesla K40.

I read somewhere that you are supposed to pass built in types by value for optimum efficiency. — The Vivandiere, Jul 22 '15 at 17:02
@StraightLine Interesting. Any sources? I guessed that for variables in general, but what about this bunch that are constant? Can they be put somewhere in a fast access memory and have better performance than actually sending 65K copies of them? — Ander Biguri, Jul 22 '15 at 17:05
Some optimizations depends on the compute capability of your device. You can exploit the 64KB of constant memory. You can also arrange all the constant values in an array and tell the compiler that this memory space will remains constant. — pQB, Jul 22 '15 at 17:11

score 18 · Accepted Answer · edited May 23 '17 at 12:08

18

Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices.

For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go:

struct arglist {
    float magicfloat_1;
    float magicfloat_2;
    //......
    float magicfloat_19;
    int magicint1;
    //......
};

__global__ void kernel(...., const arglist args)
{
    // you get the idea
}

[standard disclaimer: written in browser, not real code, caveat emptor]

If it turned out one of your magicint actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool:

template<int magiconstant1>
__global__ void kernel(....)
{
    for(int i=0; i < magconstant1; ++i) {
       // .....
    }
}

template kernel<3>(....);
template kernel<4>(....);
template kernel<5>(....);

The compiler is smart enough to recognise magconstant makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

edited May 23 '17 at 12:08

Community

1
1

answered Jul 22 '15 at 17:10

talonmies

70,661
34
192
269

Im running in a quite decent Tesla 40K GPU. So you feel that its better to just pass around 20 input parameters to the kernel as `int const1x`? Rather than "forcing" the GPU to use an specific memory? (if you can even do this). – Ander Biguri Jul 22 '15 at 17:18
3

I wouldn't pass forty scalars by value, I would pass them in a struct by value, but yes, the compiler knows best and there is no better way to do it. Actually, there is a better way for integer constants which might have a limited range of values - make them template parameters and instantiate different kernel versions. The compiler will do a lot of useful optimisations when the constants are known at compile time – talonmies Jul 22 '15 at 17:23
WOW thanks! (smart compiler). Could you please (of you don't mind) add a couple of links or code snippets of how you could pass the struct by value or make the templates? Apologize, still learning and everything feels a bit overwhelming. – Ander Biguri Jul 22 '15 at 17:26
@AnderBiguri: I've made an edit with some crude examples that might get you started. That is about the best I can do. – talonmies Jul 22 '15 at 18:56
Thats excelent actually. Thank you very much, your effort is greatly appreciated! keep on with the good work – Ander Biguri Jul 22 '15 at 19:13
I didn't realise that `const` kernel arguments would be put into constant memory. Is this mentioned somewhere in the CUDA documentation, do you know? – hertzsprung Sep 06 '19 at 16:15

Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

1 Answers1

Linked

Related