2

I'm currently building a ray marcher to look at things like the mandelbox, etc. It works great. However, with my current program, it uses each worker as a ray projected from the eye. This means that there is a large amount of execution per worker. So when looking at an incredibly complex object or trying to render with large enough precision it causes my display drivers to crash because the kernel was taking too long to execute on a single worker. I'm trying to avoid changing my registry values to make the timeout longer as I want this application to work on multiple computers.

Is there any way to remedy this? As it stands the executions of each work-item are completely independent of the work items nearby. I've contemplated subscribing a buffer to the GPU that would store the current progress on that ray and only execute a small amount of iterations. Then, I would just call the program over and over and the result would hopefully refine a bit more. The problem with this is that I am unsure how to deal with branching rays (eg. reflecting and refraction) unless I have a max number of each to anticipate.

Anyone have any pointers on what I should do to remedy this problem? I'm quite the greenhorn to OpenCL and have been having this issue for quite some time. I feel as though I'm doing something wrong or misusing OpenCL principally since my single workitems have a lot of logic behind them, but I don't know how to split the task as it is just a series of steps and checks and adjustments.

Jonathan Pearl
  • 711
  • 8
  • 21
  • 1
    Can you split it the other way around? By just running small portion of the image each time? IE: Instead of 1000x1000, launch blocks of 100x100 100 times. – DarkZeros Aug 20 '14 at 08:26
  • @DarkZeros Oh, is it the entire task's duration of operation that determines whether my display driver stops liking my kernel as opposed to any single workitem? I can try that and report how it goes, I'm not currently at home at the moment sadly. – Jonathan Pearl Aug 20 '14 at 17:23
  • So, it took a bit of refactoring. But I've divided it into multiple kernel calls and it works without the crashing. If you would like to create an answer out of your comment, @DarkZeros I will gladly accept it. Now to figure out why NVidia is making the clEnqueueNDRangeKernel call block, but that is another question for another day. Thank you for the suggestion, I was misinformed as to how the kernels were timed. – Jonathan Pearl Aug 21 '14 at 18:15

1 Answers1

7

The crash you are experiencing is caused by the HW watchdog timer of nVIDIA. Also, the OS may as well detect the GPU as not responsive and reboot it (at least Windows7 does it).

You can avoid it by many ways:

  • Improve/optimize your kernel code to take less time
  • Buy faster Hardware ($$$$)
  • Disable the watchdog timer (but is not an easy task, and not all the devices have the feature)
  • Reduce the amount of work queued to the device each time, by launching multiple small kernels (NOTE: There is a small overhead of doing it this way, introduced by the launch of each small kernel)

The easier and straightforward solution is the last one. But if you can, try the first one as well.


As an example, a call like this (1000x1000 = 1M work items, Global size):

clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(0,0)/*Offset*/, NDRange(1000,1000)/*Global*/, ... );

Can be split up in many small calls of ((100x100)x(10x10) = 1M ). Since the global size is now 100 times smaller the watchdog should not be triggered:

for(int i=0; i<10; i++)
    for(int j=0; j<10; j++)
        clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(i*100,j*100)/*Offset*/, NDRange(100,100)/*Global*/, ... );
DarkZeros
  • 8,235
  • 1
  • 26
  • 36
  • Worked great. Instead of running a worker for each ray of a 1920x1080 image I called the kernel on batches of ~50,000 workers. Which from the original around 2M workers has me call the kernel 20 times as opposed to one. I am also running the consideration of using the method I outlined earlier which would allow multiple smaller calls to the same worker to append its previous work by a small amount. But at the moment it does impede the crashing and has a rather unnoticeable overhead for calling the additional 19 kernels. Thank you, I had not realized the timer worked on the entire kernel. – Jonathan Pearl Aug 22 '14 at 15:39
  • 1
    Since the kernel runs using all the GPU resources, other apps/OS can no longer present their content to the screen. After many lost frames, they consider the GPU dead and reboot it. However, by splitting the job, it allows others tasks to be performed between kernel calls. I wonder if in OpenCL 1.2 (with device fision) a subdevice will also be affected by this behaviour, or other apps may run in parallel to the kernel execution. (since the subdevice is only a portion of the real device) – DarkZeros Aug 22 '14 at 15:47