1

I have some work I want to do on a CUDA stream, say a kernel K, which depends on previous work that needs to be done on the CPU. The exact details of the CPU work is not something that's known to me when I'm scheduling K; I just want K not to start until it is given an indication that everything is ready.

Now, if I had known exactly what CPU work is to be done, e.g. that K could start after some function foo() concludates, I could do the following:

  • Enqueue a call to foo() on stream SideStream
  • Enqueue an event E1 on SideStream
  • Enqueue a wait on event E1 on MainStream
  • Enqueue K on MainStream

but - what my CUDA scheduling code doesn't have an access to such a foo()? I want to allow some other, arbitrary place in my code to fire E1 when it is good and ready, and have that trigger K on MainStream. ... but I can't do that, since in CUDA, you can only wait on an already-enqueued (already "recorded") event.

This seems to be one of the few niches in which OpenCL offers a richer API than CUDA's: "User Events". They can be waited upon, and their execution completion status can be set by the user. See:

But certainly CUDA is able to provide something like this itself, if only to implement the OpenCL API call. So, what is the idiomatic way to achieve this effect with CUDA?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • You could have a "K-launcher thread" that waits on a condition variable – Abator Abetor May 31 '23 at 14:00
  • Is cudaWaitExternalSemaphoresAsync what you are looking for? – Abator Abetor May 31 '23 at 18:46
  • @AbatorAbetor: Hmm, possibly. I remember they introduced "external semaphore" support a while ago and I didn't pay attention to it back then. – einpoklum Jun 01 '23 at 07:43
  • @AbatorAbetor: Well, I checked, and apparently the "external semaphore" is only intended for Vulkan interoperability, not for arbitrary synchronization objects. Documentation and code suggest there's no support there for _arbitrary_ external semaphores. – einpoklum Jun 01 '23 at 20:36
  • @AbatorAbetor: See my new answer. – einpoklum Jun 01 '23 at 20:42

3 Answers3

1

One could launch a kernel before K that simply waits until a flag is set from the host. For newer GPUs, cuda::latch may be more efficient since it appears to use the nanosleep function while spinning

#include <cstdio>
#include <chrono>
#include <thread>

#include <cuda/latch>

__global__ 
void kernel(){
    printf("kernel\n");
}

__global__ 
void waitKernel(volatile int* flag){
    while(*flag != 1);
}

__global__ 
void waitKernelLatch(cuda::latch<cuda::thread_scope_system>* latchPtr){
    latchPtr->wait();
}

int main(){
    int* waitFlag;
    cudaMallocHost(&waitFlag, sizeof(int));

    cuda::latch<cuda::thread_scope_system>* latchPtr;
    cudaMallocHost(&latchPtr, sizeof(cuda::latch<cuda::thread_scope_system>));
    

    printf("wait using flag\n");
    *waitFlag = 0;
    waitKernel<<<1,1>>>(waitFlag);
    kernel<<<1,1>>>();

    printf("do some cpu stuff\n");
    std::this_thread::sleep_for(std::chrono::seconds(3));

    *waitFlag = 1;
    cudaDeviceSynchronize();



    printf("wait using latch\n");
    new (latchPtr) cuda::latch<cuda::thread_scope_system>(1);
    waitKernelLatch<<<1,1>>>(latchPtr);
    kernel<<<1,1>>>();

    printf("do some cpu stuff\n");
    std::this_thread::sleep_for(std::chrono::seconds(3));

    latchPtr->count_down();
    cudaDeviceSynchronize();


    cudaFreeHost(waitFlag);
}
Abator Abetor
  • 2,345
  • 1
  • 10
  • 12
  • 1. That's wasteful. Why should I have the GPU do busy-waiting? 2. The way you wrote your example is fully synchronous, it will probably wait forever... – einpoklum Jun 01 '23 at 07:42
  • Well, that's one way to do it. It does not wait forever, you can test it yourself. – Abator Abetor Jun 01 '23 at 07:46
  • Yes, you're right, because the kernel launch is a _bit_ asynchronous even on the default stream. – einpoklum Jun 01 '23 at 07:51
0

Here's a possible idea - based on @AbatorAbetor's comment, although I have no idea if that's what people use in practice.

  • Write a function foo() which takes a condition variable as a paramter and wait on the variable. You can use std::condition_variable for example.
  • Define a condition variable.

Now proceed as in your question - as you have exactly the function you were missing:

  • Enqueue a call to foo() on stream SideStream
  • Enqueue an event E1 on SideStream
  • Enqueue a wait on event E1 on MainStream
  • Enqueue K on MainStream

but you are not quite done: Your scheduler now passes the condition variable (while keeping it alive!) onwards/outwards, so that finally, the "CPU work" you mentioned has a reference to it. When it is done, all it needs to do is a notify operation on the condition variable: This will wake up foo(), then immediate trigger E and then K.

Caveat: I am assuming that letting a CUDA callback function block like this doesn't interfere with other CUDA runtime/driver work.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • @AbatorAbetor: That's not possible. There is no "launch K", there is only scheduling. A host thread is not guaranteed the privilege of having the GPU wait and see what it does. Other work on MainStream, dependent on K, will already have started execution if K had not been scheduled before it on MainStream. – einpoklum May 31 '23 at 15:03
  • Thank you for clarification. I misunderstood the question – Abator Abetor May 31 '23 at 18:40
0

One can probably use CUDA's "stream-ordered memory operations" functionality, avoiding host function scheduling altogether:

  • Designate a 32-bit value (preferably well-aligned) v for signaling MainStream.
  • Invoke cuMemHostRegister() on v, to get its device address (possibly the same as its host address).
  • Enqueue a wait on v (using cuStreamWaitValue32()) on MainStream
  • Enqueue K
  • Pass v to whatever code schedules the extra CPU work.
  • Make sure 1 is written to v when the CPU work is done.
einpoklum
  • 118,144
  • 57
  • 340
  • 684