0

This is a continuation of this post.

It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?

Thanks again.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
jrk0414
  • 144
  • 1
  • 1
  • 11
  • are you on windows? What is your machine configuration? (OS, GPU, CUDA version, other GPUs if any, etc.) – Robert Crovella Nov 06 '13 at 21:39
  • I'm using ubuntu 12.04 with a GeForce GTX 650. – jrk0414 Nov 06 '13 at 21:43
  • add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code, especially on the kernels. I believe it will tell you something. I suspect an interaction with X. Are you running X on the GTX650 (i.e. do you have a graphical display hosted by the GTX650)? – Robert Crovella Nov 06 '13 at 21:47
  • I suppose I am running X on the GTX650, as I have two monitors connected to it. This behavior doesn't show up in other cases, but consistently does while trying to achieve the behavior I've described. The other problem is that I can't get any errors back since the program freezes. – jrk0414 Nov 06 '13 at 21:51

1 Answers1

1

There is an interaction with X and the display driver, as well as the standard output queue and it's interaction with the graphical display driver.

A few experiments you can try, (with the sleep(2); added between the set_flag and read_flag kernels):

  1. Log into your machine over the network via ssh from another machine. I think your program will work. (X is not involved in the display in this case)
  2. comment out the line that prints out "Starting..." I think your program will then work. (This avoids the display driver/ print queue deadlock, see below).
  3. add a sleep(2); in between the "Starting..." print line and the first kernel. I think your program will then work. (This allows the display driver to fully service the first printout before the first kernel is launched, so no CPU thread stall.)
  4. Stop X and run from a console. I think your program will work.

When the GPU is both hosting an X display and also running CUDA tasks, it has to switch between the two. For the duration of the CUDA task, ordinary display processing is suspended. You can read more about this here.

The problem here is that when running X, the first printout is getting sent to the print queue but not actually displayed before the first kernel is launched. This is evident because you don't see the printout before the display freeze. After that, the CPU thread is getting stalled waiting for the display of the text. The second kernel is not starting. The intervening sleep(2); and it's interaction with the OS is enough for this stall to occur. And the executing first kernel has the display driver "stopped" for ordinary display tasks, so the OS never gets past it's stall, so the 2nd kernel doesn't get launched, leading to the apparent hang.

Note that options 1,2, or 3 in the linked custhelp article would be effective in your case. Option 4 would not.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you for the advice. Can I get the same effect of bypassing X if I disconnect my monitors from the GPU in question, or does that still cause the conflict? – jrk0414 Nov 07 '13 at 16:42
  • disconnecting the monitors won't help. However removing the X server from the GTX650 GPU would help. This requires a modification to your xorg.conf file, and the details are beyond what I can cover in comments. And obviously it means the GTX650 won't be able to display anything. – Robert Crovella Nov 07 '13 at 17:06
  • Restating my comment. One of the things that I said was logging into your machine via SSH and running that way would work around this issue, at least based on my testing. So if disconnecting the monitors means you are logging in remotely, then yes, "disconnecting the monitors" would help. – Robert Crovella Nov 07 '13 at 18:29
  • I logged in remotely using SSH, but I wasn't able to get the desired operation. I could get the print to work (same as disconnecting the monitors) but the device doesn't actually finish. I print out all of the error codes, and they all return 0. I'm stuck again. – jrk0414 Nov 08 '13 at 20:13