Cuda-gdb not stopping at breakpoints inside kernel

Question

Cuda-gdb was obeying all the breakpoints I would set, before adding '-arch sm_20' flag while compiling. I had to add this to avoid error being thrown : 'atomicAdd is undefined' (as pointed here). Here is my current statement to compile the code:

nvcc -g -G --maxrregcount=32 Main.cu -o SW_exe (..including header files...) -arch sm_20

and when I set a breakpoint inside kernel, cuda-gdb stops once at the last line of the kernel, and then the program continues.

(cuda-gdb) b SW_kernel_1.cu:49
Breakpoint 1 at 0x4114a0: file ./SW_kernel_1.cu, line 49.
...
[Launch of CUDA Kernel 5 (diagonalComputation<<<(1024,1,1),(128,1,1)>>>) on Device 0]

Breakpoint 1, diagonalComputation (__cuda_0=15386, __cuda_1=128, __cuda_2=0xf00400000, __cuda_3=0xf00200000, 
__cuda_4=100, __cuda_5=0xf03fa0000, __cuda_6=0xf04004000, __cuda_7=0xf040a0000, __cuda_8=0xf00200200, 
__cuda_9=15258, __cuda_10=5, __cuda_11=-3, __cuda_12=8, __cuda_13=1) at ./SW_kernel_1.cu:183
183     }
(cuda-gdb) c
Continuing.

But as I said, if I remove the 'atomicAdd()' call and the flag '-arch sm_20' which though makes my code incorrect, but now the cuda-gdb stops at the breakpoint I specify. Please tell me the reasons of this behaviour.
I am using CUDA 5.5 on Tesla M2070 (Compute Capability = 2.0).
Thanks!

Are you checking all your CUDA calls and the kernel launch for errors? http://stackoverflow.com/q/14038589/442006 — Roger Dahl, Feb 12 '14 at 13:26
Yes, there is no error being reported. Is there any other way to compile the code with 'Atomicadd()' call and not including '-arch sm_20' flag? because that way, cuda-gdb would work fine. — Chirag Jain, Feb 13 '14 at 04:51
Try running your program with cuda-memcheck. `AtomicAdd()` for 32-bit int has been available since compute capability 1.1, so you can compile for that architecture if you're using ints. `AtomicAdd()` for 32-bit float is available only on CC >= 2.0. — Roger Dahl, Feb 13 '14 at 05:18
Breakpoints are not necessarily hit in kernel functions since the CUDA compiler can perform some code optimizations and so the disassembled code could not correspond to the CUDA instructions. The optimizations can be different when changing compute capability. Try taking a look at the disassembled codes: `cuobjdump xxx.cubin --dump-sass`. — Vitality, Feb 13 '14 at 06:40
@JackOLantern, do you know if that can happen also in debug builds? — Roger Dahl, Feb 13 '14 at 16:35
@RogerDahl I can answer with your own answer to a question of mine, see [NVIDIA Visual Profiler, Debug and Release modes in Visual Studio 2010](http://stackoverflow.com/questions/14245892/nvidia-visual-profiler-debug-and-release-modes-in-visual-studio-2010). [Here](http://stackoverflow.com/questions/18333124/cuda-debugging-with-vs-cant-examine-restrict-pointers-operation-is-not-v), Robert Crovella suggests to use `printf`'s of the variables to avoid compiler optimizations, even with the `-G` debug switch. — Vitality, Feb 13 '14 at 18:22
@JackOLantern, heh, that's funny :) I guess I hadn't thought of the optimizations also breaking debugging. Seems it's time that NVIDIA fixes this. — Roger Dahl, Feb 13 '14 at 18:42
@RogerDahl Indeed :) Anyway, to quit self-referencing, let me say that, from the CUDA DEBUGGER User Manual, Section 3.3.1, it is written that _The `-g -G` option pair must be passed to NVCC when an application is compiled in order to debug with CUDA-GDB;_ [...] _Using this line to compile the CUDA application forces `-O0` compilation, with the exception of very limited dead-code eliminations and register-spilling optimizations._ So, I would conclude that, in principle, there is no one-to-one mapping between CUDA and disassembled codes and that it may be that the debugger skips the breakpoints. — Vitality, Feb 13 '14 at 19:00
@JackOLantern, thanks, good find. So the "aggressive optimization" in debug mode that I mentioned in that answer a year ago is not correct. Hopefully it was correct at the time :) — Roger Dahl, Feb 13 '14 at 21:10
Thanks guys, atomicAdd for integers in shared memory works for CC>= 1.2 and compiling code for sm_12 worked for me :) .I had failed to make sense out of the disassembled code, so I experimented by compiling for different architecture. — Chirag Jain, Feb 14 '14 at 15:21
One effect of the `-G` switch is to inhibit most device code compiler optimizations, one intent being to make it easy to set a breakpoint at any valid line of device source code. I don't think "aggressive optimization" when `-G` is specified is a sensible characterization. However, as already indicated, there may still be some compiler effects remaining that make setting a particular breakpoint difficult. Since no actual reproducible example is given in this question, one can only speculate. If a short, complete example can be provided, then perhaps something more definitive can be said. — Robert Crovella, Feb 14 '14 at 18:10
Would somebody like to provide an answer to this question? The OP has been given a few options (e.g. compile with different arch, printf, etc.), and seems to have chosen one of them. — Robert Crovella, Feb 14 '14 at 18:11
Higher register pressure on sm_2x and beyond as compared to sm_1x can be the reason of behaviour I am seeing. [This](https://devtalk.nvidia.com/default/topic/498288/cuda-programming-and-performance/too-many-registers-issue-with-memory-writes-and-registers/) discussion talks about the certain optimizations that were possible on sm_1x, but are no longer possible on sm_2x and beyond. PS: My code runs 40% faster with sm_12 than with sm_20 — Chirag Jain, Feb 18 '14 at 09:02

score 3 · Accepted Answer · edited May 23 '17 at 10:33

From the CUDA DEBUGGER User Manual, Section 3.3.1:

NVCC, the NVIDIA CUDA compiler driver, provides a mechanism for generating the debugging information necessary for CUDA-GDB to work properly. The -g -G option pair must be passed to NVCC when an application is compiled in order to debug with CUDA-GDB; for example,

nvcc -g -G foo.cu -o foo

Using this line to compile the CUDA application foo.cu

forces -O0 compilation, with the exception of very limited dead-code eliminations and register-spilling optimizations.

makes the compiler include debug information in the executable

This means that, in principle, breakpoints could not be hit in kernel functions even when the code is compiled in debug mode since the CUDA compiler can perform some code optimizations and so the disassembled code could not correspond to the CUDA instructions.

When breakpoints are not hit, a workaround is to put a printf statement immediately after the variable one wants to check, as suggested by Robert Crovella at

CUDA debugging with VS - can't examine restrict pointers (Operation is not valid)

The OP has chosen here a different workaround, i.e., to compile for a different architecture. Indeed, the optimization the compiler does can change from architecture to architecture.

Cuda-gdb not stopping at breakpoints inside kernel

1 Answers1

Linked