TL;DR
use __constant__ double x[3]; // ... initialization ...
First, know where a variable actually resides
In your question:
- thread-local memory: double x[3] = {1,2,3};
This is imprecise. Depends on how your code access x[]
, x[]
can reside in either registers or local memory.
Since there is no type qualifiers, the compiler will try best to put things in register,
An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory,
but when it can't, it will put them in local memory:
Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).
You really don't want x
to be in local memory, it's slow. In your situation,
an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block
Both __constant__
and __shared__
can be a good choice.
For a complete description on this topic, check: CUDA Toolkit Documentation: variable-type-qualifiers
Then, consider speed & availability
Hardcode
The number will be embedded in instructions. You may expect some performance improvement. Better benchmark your program before and after doing this.
Register
It's fast, but scarce. Consider a block with 16x16 threads, with a maximum 64k registers per block, each thread can use 256 registers. (Well, maybe not that scarce, should be enough for most kernels)
Local Memory
It's slow. However, a thread can use up to 512KB local memory.
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses...
Shared Memory
It's fast, but scarce. Typically 48KB per block (less than registers!).
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
Constant Memory
It's fast in a different way (see below), which highly depends on cache, and cache is scarce. Typically 8KB ~ 10KB cache per multiprocessor.
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
read: CUDA Toolkit Documentation: device-memory-accesses