What is the best strategy to overlap kernel execution and data transfers in a GTX Titan card?

Question

When I try to overlap data transfers and kernel execution It seems like the card is executing all memory transfers in-order, no matter what stream I use.

So, If I issue the following:

stream 1: MemcpyA_HtoD_1; Kernel_1; MemcpyA_DtoH_1
stream 2: MemcpyA_HtoD_2; Kernel_2; MemcpyA_DtoH_2

The MemcpyA_HtoD_2 will wait till the MemcpyA_DtoH_1 is completed. So no overlapping is achieved. No matter what config of streams I use, the Memcpy operations are always issued in order. So the only way for achieving overlapping involves buffering the outputs or delaying the output transfer till the next iteration.

I use CUDA 5.5, windows 7 x64 and a GTX Titan. All cpu memory is pinned and data_transfers are done using the async version.

See the following screens with the behavior:

issuing, host_to_device -> kernel -> device_to_host (normal behavior) and can not get overlap.

non overlapping

issuing host_to_device -> kernel (avoiding device_to_host after kernel) gets overlap ... because all memory copies are executed in-order, no matter what stream configuration I try.

overlapping

UPDATE

If someone is interested in reproducing this issue, I have coded a synthetic program that shows this undesired behavior. Its a complete VS2010 solution using CUDA 5.5

VS2010 Streams Not Working link

Could someone execute this on linux for testing overlapping?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"


#include <stdio.h>
#define N 1024*1024

__global__ void someKernel(int *d_in, int *d_out) {
    for (int i = threadIdx.x; i < threadIdx.x + 1024; i++) {
        d_out[i] = d_in[i];
    }
}

int main () {
    int *h_bufferIn[100];
    int *h_bufferOut[100];
    int *d_bufferIn[100];
    int *d_bufferOut[100];

    //allocate some memory
    for (int i = 0; i < 100; i++) {
        cudaMallocHost(&h_bufferIn[i],N*sizeof(int));
        cudaMallocHost(&h_bufferOut[i],N*sizeof(int));
        cudaMalloc(&d_bufferIn[i], N*sizeof(int));
        cudaMalloc(&d_bufferOut[i], N*sizeof(int));
    }

    //create cuda streams
    cudaStream_t st[2];
    cudaStreamCreate(&st[0]);
    cudaStreamCreate(&st[1]);

    //trying to overlap computation and memcpys
    for (int i = 0; i < 100; i+=2) {
        cudaMemcpyAsync(d_bufferIn[i], h_bufferIn[i], N*sizeof(int), cudaMemcpyHostToDevice, st[i%2]);
        someKernel<<<1,256, 0, st[i%2]>>>(d_bufferIn[i], d_bufferOut[i]);
        cudaMemcpyAsync(h_bufferOut[i], d_bufferOut[i], N*sizeof(int), cudaMemcpyDeviceToHost, st[i%2]);
        cudaStreamQuery(0);

        cudaMemcpyAsync(d_bufferIn[i+1], h_bufferIn[i+1], N*sizeof(int), cudaMemcpyHostToDevice, st[(i+1)%2]);
        someKernel<<<1,256, 0, st[(i+1)%2]>>>(d_bufferIn[i+1], d_bufferOut[i+1]);
        cudaMemcpyAsync(h_bufferOut[i+1], d_bufferOut[i+1], N*sizeof(int), cudaMemcpyDeviceToHost, st[(i+1)%2]);
        cudaStreamQuery(0);
    }
    cudaDeviceSynchronize();
}

It seems like Titan does not implements "Hyper-Q"? this behavior is like previous fermi cards. Acording to this white-paper http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf — Dredok, Jul 10 '13 at 11:45
I have to test putting the memcpyA_DtoH before the memcpyA_HtoD (so uploading the result of previous iteration to the cpu). It "should" work for my concrete algorithm but I'm still puzzled about how are actually implemented the copy queues in GTX Titan — Dredok, Jul 10 '13 at 12:15
It does not work either. Im really disappointed for being unable to overlap computation and execution effectively :/ — Dredok, Jul 11 '13 at 14:07
@RobertCrovella should it work in Linux? its a driver issue? — Dredok, Jul 12 '13 at 09:09
It's just a suggestion. Study similar questions on Stack Overflow and draw your own conclusions. That's what I did. I do not know what kind of issue it is, but I have seen several examples where people said they had better luck on linux. When I tried running the cuda concurrent kernels sample, I had better luck with linux. I have read anecdotally that WDDM presents some challenges for scheduling concurrent activity on the GPU. — Robert Crovella, Jul 12 '13 at 13:08
BTW I have tested puting some commands usually used to flush streams like cudaStreamQuery but no luck so far and really weird outcomes ... Unfortunately i'm tied to windows systems. — Dredok, Jul 12 '13 at 13:24
I don't like the workaround of delaying the 2nd transfer till data is ready using streamcallbacks or stuff like these. NVidia advertises Hyper-Q works with streams in GTX Titan but it seems it doesn't — Dredok, Jul 23 '13 at 07:43
Just one more data point and not exactly the one you need, but the above code shows good overlap on Ubuntu 12 + Tesla K40. So your code is good. — Levi Barnes, Sep 28 '14 at 22:48
I've run it on RHEL 6 with K20x and cuda 6.5. Works fine for me, overlap is good. — Christian Sarofeen, Dec 23 '14 at 19:35

score 1 · Answer 1 · edited May 23 '17 at 10:27

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away.

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

Make sure your concurrency problem is not related to other issues like old drivers (including bios) etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but ONLY a fresh install solved ALL the problems!

I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.

What is the best strategy to overlap kernel execution and data transfers in a GTX Titan card?

1 Answers1