I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams.
As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream) will run on concurrently:
cublasDgemm();
cublasDgemm();
BUT,when I profile the program with "NVIDIA Visual Profiler" , it shows that they run on orderly.
Then,I try to make them bind to different cudaStreams,the pseudocode is:
// Create a stream for every DGEMM operation
cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t));
for(i=0; i<batch_count; i++)
cudaStreamCreate(&streams[i]);
// Set matrix coefficients
double alpha = 1.0;
double beta = 1.0;
// Launch each DGEMM operation in own CUDA stream
for(i=0; i<batch_count; i++){
// Set CUDA stream
cublasSetStream(handle, streams[i]);
// DGEMM: C = alpha*A*B + beta*C
cublasDgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_N,
dim, dim, dim,
&alpha,
d_A[i], dim,
d_B[i], dim,
&beta,
d_C[i], dim);
}
When the batch_count=5, the result showed by "NVIDIA Visual Profiler " is :
Multi-CublasDegmm Rountines Execution Result With Multi-Streams
The result shows that they still run on orderly. How to make multi cublas apis run on really concurrently in multi cudaStreams,like this:
Multi-Kernels Execution Result With Multi-Streams,They Run on Really Concurrnently
Does anybody has any idea ? Thanks.