I read two posts on Stack Overflow, namely Will the cublas kernel functions automatically be synchronized with the host? and CUDA Dynamic Parallelizm; stream synchronization from device and they recommend the use of some synchronization API, e.g., cudaDeviceSynchronize()
after invocations to cuBLAS functions. I'm not sure it makes sense to use such a general purpose function.
Would it be better to do as follows? [Correct me if I'm wrong]:
cublasHandle_t cublas_handle;
cudaStream_t stream;
// Initialize the matrices
CUBLAS_CALL(
cublasDgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, M, M,
M, &alpha, d_A, M, d_B, M, &beta, d_C, M));
// cublasDgemm is non-blocking!
cublasGetStream(cublas_handle, &stream);
cudaStreamSynchronize(stream);
// Now it is safe to copy the result (d_C) from the device
// to the host and use it
On the other hand, cudaDeviceSynchronize
can be used preferably if lots of streams/handles were used to perform parallel cuBLAS operations. What are the "best practices" for the synchronization of cuBLAS handles? Can cuBLAS handles be thought of as wrappers around streams, in the sense that they serve the same purpose from the point of view of synchronization?