5

I have a rather large and complex CUDA code that hangs quite reliably for large numbers of blocks/threads. I am trying to figure out exactly where the code hangs.

When I run the code in cuda-gdb, I can see which threads/blocks are hanging, but I can't see where, beyond the "virtual PC".

If I compile the code with "-G" to get the debug information, it runs a lot slower and refuses to hang, no matter how long I run it for.

Is there any way to map a "virtual PC" to a line of code in the source code, even approximately? Or is there a way to get the debugging information in without turning off all optimization?

I've tried using "-G3", yet to no avail. This just gives me warnings of the type "nvcc warning : Setting optimization level to 0 as optimized debugging is not supported". I am using CUDA compilation tools release 4.1.

Pedro
  • 1,344
  • 9
  • 17
  • I guess it says it all: "optimized debugging is not supported". You can try to debug with `printf`, though – aland May 14 '12 at 21:32
  • I have 120 blocks running and would have to printf at every step they take. This slows the computation down worse than the debugging flag. That's why I'm looking for an alternative, especially with regards to mapping the "virtual PC" to a source-code line. – Pedro May 14 '12 at 21:46
  • @aland: I actually tried the `printf` statements, but it fails due to the fact that they are only flushed once the kernel returns. If the kernel hangs, then none of the `printf` statements in that call are actually emitted. – Pedro May 16 '12 at 09:37

1 Answers1

11

Ok, I think I've figured it out on my own.

If cuobjdump is in the path, then in cuda-gdb, the command x $pc will give you the assembler at which the current thread is stopped. The problem is that if the source was not compiled with -G, you won't be able to relate the assembler statement to a line in your code.

To match the assembler to the kernel code, first make sure that you compiled your kernel with nvcc -keep [..] mykernel.cu. This should generate the files mykernel.sm_20.cubin (or whatever arch you chose) and mykernel.ptx.

To get the assembler of your entire kernel, run cuobjdump -sass mykernel.cubin > output.ptx. In cuda-gdb, do x/20i $pc-80 to get a bit of context, and look for those lines in the file output.ptx. You can then try to match those lines to the PTX code in mykernel.ptx which contains .loc statements which refer to the line in source.

This approach requires a bit of creativity in matching the PTX from the cubin-file and the PTX from nvcc, as the instructions may be re-ordered somewhat. In my code, I had large blocks of FFMA instructions I could look for to get my bearings. You can use the "output.ptx" to find the exact line from the debugger and then look in "mykernel.ptx" at the same relative position.

This all involves quite a bit of work, but it does allow you to narrow-down the location of the "Virtual PC" in your original source.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Pedro
  • 1,344
  • 9
  • 17
  • Thanks for sharing this strategy. Do you know if optimized/release build debugging is now supported in CUDA? – masterxilo Feb 22 '17 at 19:41