I've been struggling for some time a problem I can't seem to find a solution to.
The problem is that when I try to debug my CUDA code using Nvidia Nsight under Visual Studio 2008 I get strange results when using shared memory.
My code is:
template<typename T>
__device__
T integrate()
{
extern __shared__ T s_test[]; // Dynamically allocated shared memory
/**** Breakpoint (1) here ****/
int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
s_test[index] = (T)index;
/* Some other irelevant code here */
}
return v;
}
When I reach Breakpoint (1)
and inspect the shared memory inside Visual Studio Watch window only the first 8 values of the array change and the others remain null. I would expect all first 64 to do so.
I thought it might have something to do with all warps not executing simultaneously. So I tried synchronizing them. I added this code inside integrate()
template<typename T>
__device__
T integrate()
{
/* Old code is still here */
__syncthreads();
/**** Breakpoint (2) here ****/
if(index < 64 && blockIdx.x==0) {
T tmp = s_test[index]; // Write to tmp variable so I can inspect it inside Nsight Watch window
v = tmp + index; // Use `tmp` and `index` somehow so that the compiler doesn't optimize it out of existence
}
return v;
}
But the problem is still there. Furthermore the rest of the values inside tmp
are not 0
as the Watch window from VS is indicating.
I must mention that it takes a lot of steps to step over __syncthreads()
, so when I reach it I just jump to Breakpoint (2)
. What is going on?
EDIT Information about the system/launch configuration
System
- Name Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz
- Architecture x86
- Frequency 2.666 MHz
- Number of Cores 2
- Page Size 4.096
- Total Physical Memory 3.582,00 MB
- Available Physical Memory 1.983,00 MB
- Version Name Windows 7 Ultimate
- Version Number 6.1.7600
Device GeForce 9500 GT
- Driver Version 301.42
- Driver Model WDDM
- CUDA Device Index 0
- GPU Family G96
- Compute Capability 1.1
- Number of SMs 4
- Frame Buffer Physical Size (MB) 512
- Frame Buffer Bandwidth (GB/s) 16
- Frame Buffer Bus Width (bits) 128
- Frame Buffer Location Dedicated
- Graphics Clock (Mhz) 812
- Memory Clock (Mhz) 500
- Processor Clock (Mhz) 1625
- RAM Type DDR2
IDE
- Microsoft Visual Studio Team System 2008
- NVIDIA Nsight Visual Studio Edition, Version 2.2 Build No. 2.2.0.12255
Compiler comands
1> "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\\bin\nvcc.exe" -G -gencode=arch=compute_10,code=\"sm_10,compute_10\" --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -D_NEXUS_DEBUG -g -D_DEBUG -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -I"inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\\include" -maxrregcount=0 --compile -o "Debug/process_f2f.cu.obj" process_f2f.cu
Launch configuration. The shared memory size and doesn't seem to matter. I've tried several versions. The one I've worked with the most is:
- Shared memory 2048 Bytes
- Grid/block sizes : {101, 101, 1} , {16, 16, 1}