How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

Question

I'm using pyOpenCL to do some complex calculations. It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB). I'm working on Mac OS X Lion (10.7.5)

The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.

I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.

I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop. This is run on a single thread (or work item, shape = (1,)) on the GPU.

__kernel void weirdError(__global unsigned int* result){
    unsigned int outer = (1<<30)-1;
    for(int i=20; i--; ){
        unsigned int inner = 0;
        while(inner != outer){
            result[0] = 1248;
            result[1] = 1337;
            inner++;
        }
        outer++;
    }
}

The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer. On CPU, it never raises an error.

The OpenCL code is run from Python using PyOpenCL.

Nothing fancy in the setup:

platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)

In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:

result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)

shape = (1,)
program.weirdError(queue, shape, None, result_buf)

cl.enqueue_copy(queue, result, result_buf)

The last line gives me:

pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue

How can this repeated assignment cause an error?
And more importantly: how can it be avoided?

I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.

DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.

The `CL_INVALID_COMMAND_QUEUE` error usually indicates that a previously executing kernel caused some sort of exception. Typically this would be something like reading or writing to memory you shouldn't be (e.g. past the end of an array). Without seeing your kernel code, it's difficult for us to suggest why this might be happening. — jprice, Jul 15 '14 at 15:13
That makes more sense, yes. In my case, would that be the `amf` kernel right before `cl.enqueue_copy(queue, result, result_buf)`, or a kernel even before that? — neXus, Jul 15 '14 at 15:32
If you are running multiple kernels, you can put a `queue.finish()` after each kernel invocation. The error you are seeing should move to the `finish()` call immediately following the kernel that is causing the problem. This should just be temporary change to debug the problem - clearing the queue after each kernel call will give you a performance hit. — jprice, Jul 15 '14 at 15:42
I used your advice to find out which kernel caused the error. It was very helpful. I then tried to debug that kernel but it took a lot longer than expected. Due to the strange behavior I decided to put everything in a single method and prune away everything unrelated to the error. I ended up with some very strange code. It's magic, the dark kind of magic that you never want to see. I think it was relevant enough to include it in the question. It might provide new insights. Any ideas? — neXus, Jul 25 '14 at 00:20
Have you tried running this on a different system? It could just be a bug in Apple's OpenCL implementation (which is pretty flaky, in my experience). — jprice, Jul 27 '14 at 16:02
This is currently the only laptop I have right now with a GPU that supports OpenCL. I initially wrote the code on a Windows, but there I could only run it on CPU and it worked just fine. — neXus, Jul 27 '14 at 16:06
After fiddling with the code, I encountered a different error, but as suggested [here](http://stackoverflow.com/questions/18410419/pyopencl-runtimeerror-clwaitforevents-failed-exec-status-error-for-events-in-w) turning it off and on again solved that, but sadly my original problem remained. — neXus, Jul 27 '14 at 16:09
@neXus : Just one general good advice : avoid to create a loop inside the kernel : a work item which take too much milliseconds will get killed.Intel arch CPUs are designed to prefer speed over parallelism *(You have less >10 cores and SSE vectors (in case of Intel) are limited to 32 chars but they will be fast.)*. GPUs prefer parallelism over speed *(you have many units with lot very large SIMD vectors up to 1024 bits)* — user2284570, Jul 27 '14 at 23:22
I tried my code on a Windows, AMD-architecture. It runs a lot faster and this toy example does not cause problems anymore. However, my real code does. I noticed indeed that Windows kills the GPU when it takes more than two seconds to complete. Python then hangs (waiting for my kernel to complete which it never does?) instead of raising an error as it did on Mac (NVIDIA). It should be possible to prevent Windows from killing and recovering the GPU (as mentioned [here](http://stackoverflow.com/questions/12259044/)) which might solve the problem. But I wonder: does mac do something similar? — neXus, Jul 28 '14 at 13:26
I could refactor to use only a single loop, but then I still won't know why this strange behavior occurred. And it really intrigues me... The idea was actually to exploit the parallel behavior that was rather obviously present in the algorithm. An existing implementation of this algorithm takes (in the extreme case) 40 hours on CPU. Using OpenCL I was able to cut the preprocessing down from half an hour to less than a minute. And I suspect the main program might run about 240 times faster, which would make it run in a mere 10 minutes. But even one hour would be acceptable. — neXus, Jul 28 '14 at 13:38

Roman Arzumanyan · Answer 1 · 2014-07-29T19:20:06.843

My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.

As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.

CPU doesn't face such difficulties for obvious reasons.

score 0 · Accepted Answer · edited May 23 '17 at 10:24

This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.

Windows

This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:

This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).

Your options are:

You need to make sure that your kernels are not too time-consuming (best).

You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.

You can run the kernel on a GPU that is not hooked up to a display.

I suggest you go with 1.

This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.

Linux

(From Cuda Release Notes, but also applicable to OpenCL)

GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.

While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.

Mac

Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

2 Answers2

Windows

Linux

Mac

Linked