Forcing CUDA to use register for a variable

Question

I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)

__global__ void simple(float *gData) {
float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
register float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

register declaration does nothing.
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
volatile float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

volatile declaration creates stack storage:
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 21 resisters, 40 bytes cmem[0]

1) Is there an easy way to tell the compiler to use register space for a variable?
2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)

.loc 2 14 2
ret;

Any idea where I can find the real machine/compiled code?

The compiler optimized the entire code away, because it doesn't modify any non-transient state. — njuffa, Aug 28 '12 at 21:57
Asking for 1024 registers per thread is a pretty tall order. Most kernels require ~dozens of registers per thread. If you want to be absolutely sure the compiler can use a register for a variable, it needs to be a scalar (i.e., not an array you index in a `for` loop). — Jared Hoberock, Aug 28 '12 at 22:59
The where/what stack frame answer can be found here: http://stackoverflow.com/questions/7810740/where-does-cuda-allocate-the-stack-frame-for-kernels — Doug, Aug 29 '12 at 15:08
float a1,a2,a3,a4,a5; // each 'a' gets a reg . . . . . . . . . . . . volatile float b1,b2,b3,b4,b5; // each 'b' is on the stack (local mem) . . . . . . . . . . . The 'volatile' declaration does nothing for reg allocation, but it does create a local-mem stack — Doug, Aug 29 '12 at 17:35

score 23 · Accepted Answer · answered Aug 29 '12 at 01:04

Dynamically indexed arrays cannot be stored in registers, because the GPU register file is not dynamically addressable.
Scalar variables are automatically stored in registers by the compiler.
Statically-indexed (i.e. where the index can be determined at compile time), small arrays (say, less than 16 floats) may be stored in registers by the compiler.

SM 2.0 GPUs (Fermi) only support up to 63 registers per thread. If this is exceeded, register values will be spilled/filled from local (off-chip) memory, supported by the cache hierarchy. SM 3.5 GPUs expand this to up to 255 registers per thread.

In general, as Jared mentions, using too many registers per thread is not desireable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.

Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).

The example you give may be a case for shared memory if:

Many threads in the block use the same data, or
The per-thread array size is small enough to allocate enough space for all threads in multiple thread blocks (1024 floats per thread is far much).

As njuffa mentioned, the reason your kernel only uses 2 registers is because you don't do anything useful with the data in the kernel, and the dead code was all eliminated by the compiler.

You are suggesting there is a limit to the # of reg's that a thread can use (63 for SM_20). Where does this come from? The device properties shows a limit to the # of reg's per BLOCK (regsPerbBock). — Doug, Aug 29 '12 at 15:12
It comes from the architecture, and the compiler takes care of ensuring no register number greater than the limit is used in the generated binary code. A user need not worry about this limit other than for performance reasons (to understand the cause of register spilling, for example), which is why there's no need to list it in the deviceProps structure. — harrism, Aug 30 '12 at 06:18
Using many registers may be desireable because maximizing occupancy is not the only way to hide latency. Another way to hide latency is instruction-level parallelism. Sometimes it is the only way to reach peak performance. Check Vasily Volkov [slide](http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf) where autor got peak performance at only 8% occupancy. — Kamil Czerski, Jul 25 '14 at 14:37

score 6 · Answer 2 · edited May 23 '17 at 12:15

As noted already, registers (and the PTX "param space") cannot be indexed dynamically. In order to do that the compiler would have to emit code as for a switch...case block to turn the dynamic index into an immediate. I'm not sure it ever does automatically. You can help it happen using a fixed size tuple structure and a switch...case. C/C++ metaprogramming is likely to be the weapon of choice to keep code like this manageable.

Also, for CUDA 4.0 use the command line switch -Xopencc=-O3 in order to have anything but plain scalars (such as data structures) mapped to registers (see this post). For CUDA > 4.0 you have to disable debug support (no -G command line option - optimization happens only when debugging is disabled).

PTX level allows many more virtual registers than the hardware. Those are mapped to hardware registers at load time. The register limit you specify allows you to set an upper limit on the hardware resources used by the generated binary. It serves as a heuristic for the compiler to decide when to spill (see below) registers when compiling to PTX already so certain concurrency needs can be met (see "launch bounds", "occupancy" and "concurrent kernel execution" in the CUDA Documentation - you might also enjoy this most interesting presentation).

For Fermi GPUs there are at most 64 hardware registers. The 64th (or the last - when using less than the hardware's maximum) is used by the ABI as the stack pointer and thus for "register spilling" (it means freeing up registers by temporarily storing their values on the stack and happens when more registers are needed than available) so it is untouchable.

The link about -Xopencc=-O3 is gone and I can't really find any references to that in the context of CUDA. Could you point me out to some resource or explain if the behaviour with recent cuda (7.0/7.5) is similar? — XapaJIaMnu, Sep 14 '15 at 12:35

Forcing CUDA to use register for a variable

2 Answers2

Linked