I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)
__global__ void simple(float *gData) {
float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]
__global__ void simple(float *gData) {
register float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
register declaration does nothing.
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]
__global__ void simple(float *gData) {
volatile float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
volatile declaration creates stack storage:
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 21 resisters, 40 bytes cmem[0]
1) Is there an easy way to tell the compiler to use register space for a variable?
2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)
.loc 2 14 2
ret;
Any idea where I can find the real machine/compiled code?