How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

Question

I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams.

As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream) will run on concurrently:

cublasDgemm();
cublasDgemm();

BUT,when I profile the program with "NVIDIA Visual Profiler" , it shows that they run on orderly.

Then,I try to make them bind to different cudaStreams,the pseudocode is:

// Create a stream for every DGEMM operation
cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t));
for(i=0; i<batch_count; i++)
    cudaStreamCreate(&streams[i]);

// Set matrix coefficients
double alpha = 1.0;
double beta  = 1.0;

// Launch each DGEMM operation in own CUDA stream
for(i=0; i<batch_count; i++){
    // Set CUDA stream
    cublasSetStream(handle, streams[i]);

    // DGEMM: C = alpha*A*B + beta*C
    cublasDgemm(handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                dim, dim, dim,
                &alpha,
                d_A[i], dim,
                d_B[i], dim,
                &beta,
                d_C[i], dim);
}

When the batch_count=5, the result showed by "NVIDIA Visual Profiler " is :

Multi-CublasDegmm Rountines Execution Result With Multi-Streams

The result shows that they still run on orderly. How to make multi cublas apis run on really concurrently in multi cudaStreams,like this:

Multi-Kernels Execution Result With Multi-Streams,They Run on Really Concurrnently

Does anybody has any idea ? Thanks.

A gemm call above a particular size will launch kernels with enough blocks to fill a GPU so that subsequent kernel launches have no room to run concurrently. Kernels that are small enough (in terms of blocks) will tend to execute so quickly that concurrency is again hard to witness. This question has a premise that one can make any 2 kernels run concurrently; this is simply not true. Witnessing kernel concurrency is actually pretty difficult and requires carefully crafted kernel launches, even apart from CUBLAS. — Robert Crovella, Dec 30 '16 at 05:08
@Robert Crovella ,thanks for your comment. BUT I doubt that "A gemm call above a particular size will launch kernels with enough blocks to fill a GPU so that subsequent kernel launches have no room to run concurrently." ,because when try to execute gemm with different dimension of matrixs(the dimension is enough small , that means the gpu resource have room to run another gemm), the result is the same. — Yangsong Zhang, Dec 30 '16 at 08:21
When the dimension is small enough, the kernel execution time is short enough (e.g. a few microseconds) that it may be contained within the kernel launch latency, and so it is difficult or impossible to see concurrency. If you disagree, provide the profiler output for that case, as well as the number of blocks contained in the kernel launch (discoverable from the profiler). The large case you have actually shown is not going to run concurrently due to the block count in each kernel. — Robert Crovella, Dec 30 '16 at 08:36
And, if you have tiny little m=n=k=16 matrix multiply operations that you want to run on the GPU efficiently, you might want to try the batched gemm functions available in CUBLAS, instead. — Robert Crovella, Dec 30 '16 at 08:58
@Robert Crovella, thanks for your immediate response. I am sorry, actually, you are right. when I execute gemm with dim(m=n=k=256), the profiler shows that multi-gemm routines run concurrently, the case that dim(m=n=k=16) should cause by the kernel launch latency. Unfortunately,most of the cases in my works, the dim of matrix is large enough to fill a GPU, so IS IT IMPOSSIBILE to run multi-gemm rountines concurrently ? Does the following way works: multi-host threads control mutil- gemm rountines(that means one host thread invokes just only one gemm rountine). Thanks. — Yangsong Zhang, Dec 30 '16 at 09:50
There is no point in trying to get additional concurrency in the large cases that "fill" the GPU. Once the GPU is "filled", then generally speaking, there is nothing to be gained by attempting to run more things concurrently (or exposing additional parallelism). The GPU in that scenario is already processing your instructions as fast as possible, and attempting to run additional gemm calls concurrently will not get your work done any faster. — Robert Crovella, Dec 30 '16 at 15:46
You can answer your own question if you want. Since you've now found a particular test example (m=n=k=256) for your case that shows concurrent execution of gemm kernels, you could provide that as an answer to your question "How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently". If you explained also the large and small (m=n=k=16) cases as discussed above, I'm sure it would be useful for others. — Robert Crovella, Dec 30 '16 at 15:59
@RobertCrovella OP probably wants to know whether it is possible to run big kernels concurrently despite the fact that hardware is saturated by any of those kernels. Maybe something similar to how it can be done on host: e.g. having N hardware cores, OS may schedule N threads for a quantum of time in round robin fashion, so all M threads do progress, even if M > N. This is surely slower than running ops one after another, but this may be necessary for reasons other than just speedup. I don't think CUDA implements this, but I may miss recent CUDA improvements. — Ivan Aksamentov - Drop, Jan 01 '17 at 13:45
@Robert Crovella, thank you. I have post an answer for this question. BTW, I not sure about 'whether one cublas api has filled one gpu or not ', how can I judge it with gridSize,blockSize, registers/thread provided by 'NVIDIA Visual Profiler' and gpu's properties. — Yangsong Zhang, Jan 03 '17 at 04:07

score 3 · Answer 1 · edited Jan 03 '17 at 04:04

3

Firstly, thanks for @Robert Crovella's comments.

According to @Robert Crovella's help and my research,we can run multi CUBLAS APIs(e.g. cublasDgemm) concurrently in some special cases, but most of cases can not.

CASE 1：When I execute cublasDgemm with large dims of (m=n=k=1024*8) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=1024*8)

CASE 2：When I execute cublasDgemm with small dims of (m=n=k=64) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=64)

CASE 3：BUT when I execute cublasDgemm with dims of (m=n=k=256) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=256)

From the result of CASE 1 and CASE 2 ,it shows that we can not, not only with large dims and also small dims, run CUBLAS APIs concurrently. The reason for case 1 is that the gpu resources have been used up,so no left room to run another routines, and for case 2, it is the latency of two kernels launch that cause it's difficulty to see con.

edited Jan 03 '17 at 04:04

tera

7,080
1
21
32

answered Jan 03 '17 at 03:51

Yangsong Zhang

71
6

If you look closely, you will notice that kernels from both streams overlap in all cases - the second kernel always starts before the first kernel finishes. – tera Jan 03 '17 at 04:12
@tera Thanks.Yes ,you are right. Can you make some explanation on this phenomenon. – Yangsong Zhang Jan 03 '17 at 06:52
1

This shows that concurrent execution works - the second kernel starts once resources are freed up by the first, with a small latency. It's just that you don't get *complete* overlap for the reasons stated in your answer. You could compare these profiler results with the case where you launch both Dgemms in the same stream to find the small overlap go away completely. – tera Jan 03 '17 at 09:37
1

It's the so-called "tail effect", see [here](http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf) slides 19-23. The tail effect occurs when a kernel is finishing and its threadblocks are "draining" from the GPU, leaving more and more "empty space". A subsequent kernel launch can begin to utilize this empty space if launched into a separate stream. – Robert Crovella Jan 03 '17 at 18:46
Thank you. Does the current CUDA API support Peer-to-peer memcpy operations(memcpy from one device to another one)run concurrently with other operation ? From the book <>, it can not support. – Yangsong Zhang Jan 04 '17 at 12:51
From my test program, 'NVIDIA Visual Profiler' shows that Peer-to-peer memcpy operation can run concurrently with kernel execution(K40, CUDA Driver Version = 7.5,CUDA Runtime Version = 7.5). – Yangsong Zhang Jan 04 '17 at 13:30
This seems completely unrelated to your posted question, but yes, in the proper circumstances, P2P device-to-device memcopy via `cudaMemcpyPeerAsync` can occur concurrently with other operations (e.g. kernels, other cudaMemcpy traffic). – Robert Crovella Jan 07 '17 at 23:56

How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

1 Answers1