13

Which is the best way of using constants in CUDA?

One way is to define constants in constant memory, like:

// CUDA global constants
__constant__ int M;

int main(void)
{
    ...
    cudaMemcpyToSymbol("M", &M, sizeof(M));
    ...
}

An alterative way would be to use the C preprocessor:

#define M = ... 

I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
jrsm
  • 1,595
  • 2
  • 18
  • 39
  • 3
    constants that are known at compile time should be defined using preprocessor macros (i.e. `#define`). In other cases, `__constant__` [variables](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#constant) may be one option the CUDA programmer uses to optimize code which accesses computed variables that do not change. Note that your usage of `"M"` to reference a symbol is no longer valid in cuda 5. – Robert Crovella Apr 20 '13 at 12:43
  • It would be interesting to know how much the runtime difference between these two possibilities is. I am currenlty working on some cfd codes and i would like to pass the parameters as options to the programm, therefore it would be necessary to use the first approach. On the other hand if i use preprocessor macros this would not be possible. – jrsm Apr 20 '13 at 14:13
  • Since your second example generates no machine code of any type, it's not a sensible question. You need to pose an actual runtime usage scenario in order to make any sense of that question. For the initial load of a single scalar immediate value into a variable or register, the second method will always be quicker. – Robert Crovella Apr 20 '13 at 14:14
  • 2
    If you want to pass parameters that are computed at run-time and then provided to a cuda kernel, the second method could not possibly work, so what is the point of the question? When you pass ordinary parameters to a cuda kernel via the function parameter list, they are passed under the hood via `__constant__` mechanism/memory anyway, for cc 2.0 devices and beyond. – Robert Crovella Apr 20 '13 at 14:19
  • I am sorry i think you misunderstood me, the parameters don't get changed at runtime, they are just defined at the beginning of the programm. – jrsm Apr 20 '13 at 14:37
  • The programm works currently with the second approach but since I want to change parameters from simulation to simulation (not at runtime), i want to use commandline options so i dont need to recompile the programm each time. Therefore i would use the first approach if its not slower. – jrsm Apr 20 '13 at 14:47
  • 2
    Reading a command line, and then making a run-time decision about which parameter to pass (based on the command line), is effectively computing that parameter at run-time. If the options you are modifying are perhaps small in number, you could also consider using [templated code/kernels](http://stackoverflow.com/questions/6179295/if-statement-inside-a-cuda-kernel/6179580#6179580). – Robert Crovella Apr 20 '13 at 15:02
  • @RobertCrovella I think you provided enough material answering this question. Why don't you post a full answer? I will upvote it. – Vitality Nov 07 '13 at 06:18

2 Answers2

19
  1. constants that are known at compile time should be defined using preprocessor macros (e.g. #define) or via C/C++ const variables at global/file scope.
  2. Usage of __constant__ memory may be beneficial for programs who use certain values that don't change for the duration of the kernel and for which certain access patterns are present (e.g. all threads access the same value at the same time). This is not better or faster than constants that satisfy the requirements of item 1 above.
  3. If the number of choices to be made by a program are relatively small in number, and these choices affect kernel execution, one possible approach for additional compile-time optimization would be to use templated code/kernels
Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 3
    Use of C/C++ style constants, preprocessor macros, or C++ templates may be faster than using __constant__ memory for multiple reasons: 1. The compiler can apply additional optimizations, and 2. the constant can be embedded in the instruction as an immediate. Constant cache accesses can miss the constant cache adding additional latency. – Greg Smith Jun 10 '14 at 06:07
  • Won't `#define`d constants of non-trivial types be slow because of calls to constructors in each thread each time? – Serge Rogatch Sep 12 '16 at 08:15
  • Is there any performance difference between using macros and `const`s for primitive types (like `int` or `float`)? – Kolay.Ne Mar 14 '20 at 08:38
  • When defining a "const int x" in a kernel function, where is it located? In a register? – Silicomancer Jan 27 '22 at 10:37
  • According to the first comment "2. the constant can be embedded in the instruction as an immediate." That doesn't always happen, but it is one possibility. An exact answer can't be given without an actual complete code. Then, you can answer it yourself with some study of [CUDA SASS and binary utilities](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html). – Robert Crovella Jan 27 '22 at 15:08
8

Regular C/C++ style constants: In CUDA C (itself a modification of C99) constants are absolute compile time entities. This is hardly surprising given the amount of optimization that happens in NVCC is VERY involved given the nature of GPU processing.

#define: macros are as always very inelegant but useful in a pinch.

The __constant__ variable specifier is, however a completely new animal and something of a misnomer in my opinion. I will put down what Nvidia has here in the space below:

The __constant__ qualifier, optionally used together with __device__, declares a variable that:

  • Resides in constant memory space,
  • Has the lifetime of an application,
  • Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

Nvidia's documentation specifies that __constant__ is available at register level speed (near-zero latency) provided it is the same constant being accessed by all threads of a warp.

They are declared at global scope in CUDA code. HOWEVER based on personal (and currently ongoing) experience you have to be careful with this specifier when it comes to separate compilation, like separating your CUDA code (.cu and .cuh files) from your C/C++ code by putting wrapper functions in C-style headers.

Unlike traditional "constant" specified variables however these are initialized at runtime fromthe host code that allocates device memory and ultimately launches the kernel. I repeat I am currently working code that demonstrates these can be set at runtime using cudaMemcpyToSymbol() before kernel execution.

They are quite handy to say the least given the L1 cache level speed that is guaranteed for access.

opetrenko
  • 346
  • 3
  • 8