Get rid of busy waiting during asynchronous cuda stream executions

Question

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     while (true) {
         if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
             cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
             kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
             break;
         }
         sid = ++sid % S_N;
     }

}

Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?

EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.

EDIT2: I got rid of busy-waiting in fallowing way:

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
    kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
    sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
    cudaStreamSynchronize(streams[i]);
    cudaStreamDestroy(streams[i]);
}

But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.

I don't think the code above does what you want it to do. It does **not** wait for stream 0 to finish before stream 1 starts. Instead, it makes sure that any previous launches in stream 0 finish before launching more jobs on stream 0 (which is unnecessary, because that's already how streams work). To make your code do what you're asking, you want cudaThreadSynchronize(), cudaStreamSynchronize(0), or cudaStreamSynchronize(streams[sid-1]). — jmilloy, Feb 24 '11 at 18:34
Yes, you were right, I added while-true to run each of the streams in parallel. And now in this loop I am checking which of the streams finished to execute new one. — kokosing, Feb 25 '11 at 08:38
No. You're edits do not do what you say they do, and you do not understand how streams work. In the first example, there was **no** waiting - your cudaStreamQueries **always** returned true because you call cudaStreamQuery(x) *before* you put launch anything on stream *x*. In the new example, you call kernels before you synchronize. It is slower because the synchronize has to actually wait for your memcpy/kernel to finish. — jmilloy, Feb 25 '11 at 12:25
In first example cudaStreamQuery does not always return true, it starts return false after I schedule work for all of the streams, and I still have data to schedule to work on it. In that case I am checking in while loop when any of the streams is ready to schedule new job - so I have busy/active waiting for free stream. The second example as you said is slower, but it does not have busy-waiting, and work is scheduled over the streams. — kokosing, Feb 25 '11 at 12:54
Yes your right eventually it starts to return true because you schedule a new job on an old stream. The point is, there is no need to do this! When you schedule two jobs on one stream, the second will wait for the first to finish before it starts. You do not have to manage it, the device does! You can launch them all right in a row. (see my overly detailed answer below) — jmilloy, Feb 25 '11 at 13:04
Yes I know I can do that, but what when execution time of kernel depends on given data. When I distribute jobs over the streams it may happen that many of them become idle, because one stream got all the jobs which are very long lasting. — kokosing, Feb 25 '11 at 13:15
hmm. now i see... how big is DATA_SIZE? any chance you can just put *every* kernel into it's own stream? — jmilloy, Feb 25 '11 at 13:22
Unfortunately, there is no such chance, there is no enough space on the device memory. Moreover that solution with streams does not require to allocate huge data on the device. — kokosing, Feb 25 '11 at 14:09
I think it would help if you gave us more details. How big is DATA_SIZE (roughly)? How long does a kernel launch take? Apparently there is a wide range, so what is the range? Have you characterized the problem, and if so, how bad is it? That is, if you just distribute the jobs evenly over S_N streams, how much time is wasted that you expect to recover? — jmilloy, Feb 25 '11 at 14:18
DATA_SIZE~=4GB, single kernel execution time ~= 2ms S_N=32, with busy waiting I got 5.43GB/s with synchronization at the end of kernel executions I got 5.29GB/s (kernel with memcpy). Device memory size for one stream = 16MB. Kernel implements histogram algorithm. It is only testing data right now, we expected to have much more data and non On complexity algorithms (like histogram). So I expect that time of kernel execution will vary depends on given size for different algorithms. — kokosing, Feb 25 '11 at 14:39
ah, sorry the more interesting question is how many kernels do you have to launch to process all of the data. No more than 4, right? What is DATA_STEP? — jmilloy, Feb 25 '11 at 16:55

score 6 · Answer 1 · edited Nov 10 '15 at 23:37

The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.

However, you need to understand how streams and sychronization work before you will be able to use them in your code.

What happens if you do not use streams at all? Consider the following code:

kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();

The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().

Now, what if you have two different kernels?

kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);

kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:

kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);

There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.

So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
    kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
 }

Now, on to streams. you can use streams to manage concurrent device execution.

Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.

That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
 }

No explicit device synchronization necessary.

Thank you, your answer is very helpful. But what I want to achieve is to run a kernel and copy memory for another kernel concurrently. So I think in that case I do need to use streams. And after the kernel executions I need to synchronize them with host thread, because I want to copy the results to the hosts. — kokosing, Feb 25 '11 at 14:02

kokosing · Accepted Answer · 2011-03-08T14:13:39.217

My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed. Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.

Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news

EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.

score 3 · Answer 3 · answered Feb 24 '11 at 18:02

There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.

Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?

jmilloy · Answer 4 · 2011-02-24T18:25:12.417

Instead of cudaStreamQuery, you want cudaStreamSynchronize

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaStreamSynchronize(streams[sid]);
     cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
}

(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)

You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.

score 1 · Answer 5 · edited Sep 21 '20 at 14:57

1

You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.

like this:

size = N*sizeof(float)/nStreams;

for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}


for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}

In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.

edited Sep 21 '20 at 14:57

Soleil

6,404
5
41
61

answered Feb 25 '11 at 05:43

jwdmsd

2,107
2
16
30

1

The launches all occur so quickly that it doesn't make a difference. Without seeing the broader context of the code, it's impossible to tell what synchronization is necessary/best, if any. – jmilloy Feb 25 '11 at 05:56
I run it on the device with 2.x compute capability, it supports concurrent data transfer so your code does not make a difference – kokosing Feb 25 '11 at 08:46

Get rid of busy waiting during asynchronous cuda stream executions

5 Answers5

Linked