When using OpenCL on many older nVidia cards calls to clEnqueueNDRangeKernel(...) do not return until the computation is complete. See: clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU) .
The OpenCL standard implies that clEnqueueNDRangeKernel(...) should be asynchronous and it is, in fact, a non-blocking function when using the AMD and Intel implementations of OpenCL.
Has this been fixed on more modern nVidia GPGPUs?