CUDA Unable to see shared memory values in Nsight debugging

Question

I've been struggling for some time a problem I can't seem to find a solution to.

The problem is that when I try to debug my CUDA code using Nvidia Nsight under Visual Studio 2008 I get strange results when using shared memory.

My code is:

template<typename T>
__device__
T integrate()
{
   extern __shared__ T s_test[]; // Dynamically allocated shared memory
   /**** Breakpoint (1) here ****/
   int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
   if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
      s_test[index] = (T)index;
      /* Some other irelevant code here */
   }
   return v;
}

When I reach Breakpoint (1) and inspect the shared memory inside Visual Studio Watch window only the first 8 values of the array change and the others remain null. I would expect all first 64 to do so. Watch window from Visual Studio

I thought it might have something to do with all warps not executing simultaneously. So I tried synchronizing them. I added this code inside integrate()

template<typename T>
__device__
T integrate()
{
   /* Old code is still here */

   __syncthreads();
   /**** Breakpoint (2) here ****/
   if(index < 64 && blockIdx.x==0) {
      T tmp = s_test[index]; // Write to tmp variable so I can inspect it inside Nsight Watch window
      v = tmp + index; // Use `tmp` and `index` somehow so that the compiler doesn't optimize it out of existence
   }
return v;
}

But the problem is still there. Furthermore the rest of the values inside tmp are not 0 as the Watch window from VS is indicating. Watch window from Nsight

I must mention that it takes a lot of steps to step over __syncthreads(), so when I reach it I just jump to Breakpoint (2). What is going on?

EDIT Information about the system/launch configuration

System

Name Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz
Architecture x86
Frequency 2.666 MHz
Number of Cores 2
Page Size 4.096
Total Physical Memory 3.582,00 MB
Available Physical Memory 1.983,00 MB
Version Name Windows 7 Ultimate
Version Number 6.1.7600

Device GeForce 9500 GT

Driver Version 301.42
Driver Model WDDM
CUDA Device Index 0
GPU Family G96
Compute Capability 1.1
Number of SMs 4
Frame Buffer Physical Size (MB) 512
Frame Buffer Bandwidth (GB/s) 16
Frame Buffer Bus Width (bits) 128
Frame Buffer Location Dedicated
Graphics Clock (Mhz) 812
Memory Clock (Mhz) 500
Processor Clock (Mhz) 1625
RAM Type DDR2

IDE

Microsoft Visual Studio Team System 2008
NVIDIA Nsight Visual Studio Edition, Version 2.2 Build No. 2.2.0.12255

Compiler comands

1> "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\\bin\nvcc.exe"  -G  -gencode=arch=compute_10,code=\"sm_10,compute_10\"   --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -D_NEXUS_DEBUG -g  -D_DEBUG -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -I"inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\\include" -maxrregcount=0  --compile -o "Debug/process_f2f.cu.obj" process_f2f.cu

Launch configuration. The shared memory size and doesn't seem to matter. I've tried several versions. The one I've worked with the most is:

Shared memory 2048 Bytes
Grid/block sizes : {101, 101, 1} , {16, 16, 1}

Can you provide device info, nsight version, driver version, launch dimensions, and dynamic shared memory size? If you provide a full reproducible including compiler switches then the team can look into the issue. — Greg Smith, Oct 02 '12 at 21:17
I've added all the information. Thank you very much for your help. — Iam, Oct 03 '12 at 10:35
@Iam, I am unable to repro this issue. Would you mind getting into contact with me? You may email me at devtools-support at nvidia dot com. Thanks! — Jeff Davis, Oct 12 '12 at 03:14
I'm voting to close this question as off-topic because the problem couldn't be reproduced and seems to have turned into an off Stack Overflow bug report with a vendor — talonmies, Jan 01 '16 at 14:30

score 1 · Answer 1 · edited May 05 '23 at 15:29

Have you tried putting __syncthreads() after assigning the values?

template<typename T>
__device__
T integrate()
{
   extern __shared__ T s_test[]; // Dynamically allocated shared memory
   int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
   if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
      s_test[index] = (T)index;
      /* Some other irelevant code here */
   }
   __syncthreads();
   /**** Breakpoint (1) here ****/
   return v;
}

And try to see the values at this breakpoint.

CUDA Unable to see shared memory values in Nsight debugging

1 Answers1