Just a general question about cublas. For a single thread, if there is not memory transfer from GPU to CPU (e.g. cublasGetVector), will the cublas kernel functions (eg cublasDgemm) automatically be synchronized with the host?
cublasDgemm();
//cublasGetVector();
host_functions()
Furthermore, what about between two adjacent kernel calls?
cublasDgemm();
cublasDgemm();
and, what about a synchronized transfer that does not involve the global memory used in the previous kernel?
cublasDgemm(...gA...gB...gC);
cublasGetVector(...gD...D...);