cuBLAS synchronization best practices

Question

I read two posts on Stack Overflow, namely Will the cublas kernel functions automatically be synchronized with the host? and CUDA Dynamic Parallelizm; stream synchronization from device and they recommend the use of some synchronization API, e.g., cudaDeviceSynchronize() after invocations to cuBLAS functions. I'm not sure it makes sense to use such a general purpose function.

Would it be better to do as follows? [Correct me if I'm wrong]:

cublasHandle_t cublas_handle;
cudaStream_t stream;
// Initialize the matrices
CUBLAS_CALL(
  cublasDgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, M, M, 
    M, &alpha, d_A, M, d_B, M, &beta, d_C, M));
// cublasDgemm is non-blocking!
cublasGetStream(cublas_handle, &stream);
cudaStreamSynchronize(stream);
// Now it is safe to copy the result (d_C) from the device
// to the host and use it

On the other hand, cudaDeviceSynchronize can be used preferably if lots of streams/handles were used to perform parallel cuBLAS operations. What are the "best practices" for the synchronization of cuBLAS handles? Can cuBLAS handles be thought of as wrappers around streams, in the sense that they serve the same purpose from the point of view of synchronization?

What is the reason why you do not like cudaDeviceSynchronize? Also, in your example, you are not setting the stream before the cuBLAS call. Finally, why putting streams into play? For only one stream, will stream synchronization pdrform differently than device synchronization? — Vitality, Apr 10 '14 at 13:31
@JackOLantern I've read that `cudaDeviceSynchronize` in general slows down the execution, so I thought it's better to avoid it. Additionally, `cudaStreamSynchronize` tells the device what exactly to synchronize. Maybe it doesn't make any difference, I just wanted to know what the best practices are when parallelizing cuBLAS operations. — Pantelis Sopasakis, Apr 10 '14 at 13:41
@JackOLantern Also, maybe there exist other things that do other stuff and it doesn't make sense to wait for them. In this sense cudaStreamSynchronize(stream) should be a better option I guess. — Pantelis Sopasakis, Apr 10 '14 at 13:42
We just tried this, with managed memory and `cudaStreamSynchronize( stream )` was not enough. It is only when we did cudaDeviceSynchronize` that we were able to get consistent results (no races after Dgemm call). Maybe managed memory requires full synchronization? — alfC, May 02 '21 at 01:39

score 6 · Accepted Answer · edited May 23 '17 at 11:46

If you are using a single stream, it doesn't make a difference whether you will synchronize that one stream or you use cudaDeviceSynchronize(). In terms of performance and effect it should be exactly the same. Note that when using events to time part of your code (e.g., a cublas call) it's always good practice to call cudaDeviceSynchronize() to get meaningful measurements. From my experience, it doesn't impose any significant overhead and, besides, it's safer to time your kernels with it.

If your application uses multiple streams, then it makes sense to synchronize only against the stream you want. I believe that this question will be helpful to you. Also, you can read the CUDA C Programming guide, Section 3.2.5.5.

score 3 · Answer 2 · answered Apr 10 '14 at 19:29

It's not clear in your example that you would need to use explicit synchronization at all or why you would need to use it.

CUDA operations issued to the same stream are serialized. If you launch a kernel, or a cublas call, and then follow that kernel or cublas call with a cudaMemcpy operation (or cublasGetVector/Matrix, etc.), the copy operation is guaranteed not to start until all previous CUDA activity issued to the same stream is complete.

The best practice for general cases is not to use explicit synchronization at all. Place activities which must be serially dependent in the same stream. Place activities which have no dependency on each other in separate streams.

There are many cuda codes, using cublas and otherwise, that don't use explicit synchronization at all. Your example has no particular need of it. Note that in the first answer you linked, talonmies said:

you need to call a blocking API routine like a synchronous memory transfer or...

In your example, that is exactly what you would do. You would call a memory transfer, either issued to the same stream (e.g. cudaMemcpyAsync) or default blocking transfer (like cudaMemcpy) and it would work just fine. No need for an explicit sync.

You may wish to read the appropriate programming guide section

cuBLAS synchronization best practices

2 Answers2