Why CUDA memory copy speed behaves like this, some constant driver overhead?

Question

I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.

I have a typical code using GPU calculation:

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.005 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    my_kernel<<<128, 128>>>(output);
    //several other calls of different kernels
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 
}

I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize call):

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.001 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    // Do some logic with returned value 
}

I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?

It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.

0.05 ms is the average per iteration. If I were you I'll get all the times individually to see if it's a constant value. The first time is used to have an overhead. — pQB, Oct 30 '12 at 08:10
How do you assess the elapsed time? Do you call a CUDA function before entering to the loop, to eliminate the initialization cost of the device, like calling cudaFree(0). — phoad, Oct 30 '12 at 20:50
@pQB, yes, it's an average, but I think it shows the real picture of the time I get when excecuting the code from my external CPU code. When I change the iterations number from 1000 to, for example, 2000 I have twice the time I had before (as expected). The same happens when I set iterations number to 500 - the time decreases by 2 times. — otter, Oct 30 '12 at 21:16
@phoad, I just call GetSystemTime before and after the cycle. About some function call to initialize the device - could you be so kind to give me some links where I can read about it? One of my main versions of what is going on here is that the device is set to some "sleep mode" or something like this. That may happen after a heavy CPU code execution. I'm going to run some tests in which I'll try to eliminate the CPU execution. — otter, Oct 30 '12 at 21:27
http://stackoverflow.com/questions/11704681/cuda-cutil-timer-confusion-on-elapsed-time Just check this link. It has information about using a better CUDA-provided timer and how to eliminate initialization cost from timings.. — phoad, Oct 30 '12 at 23:01

score 1 · Answer 1 · answered Oct 30 '12 at 13:37

1

I recommend you to increase the number of threads you are implementing

    //Use malloc() to allocate memory on CPU. 
    //Change mem_size to the total memory to be tranferred to GPU.        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    dim3 dimBlock(128,2);
    dim3 dimGrid(64000,1);
    my_kernel<<<dimGrid, dimBlock>>>(output);
    //several other calls of different kernels
    //change size field to 1000*sizeof(FLOAT_T)
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value

If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.

answered Oct 30 '12 at 13:37

Fr34K

534
6
19

Unfortunately, I can't do the way you've described. With that cycle (from 0 to 1000) I try to emulate my real project behaviour: it receives the data from the internet which updates 1000-10000 times a second (some stocks prices). On every update I should perform some heavy logic (about 10000-100000 iterations) with which CUDA cards cope much better than any CPU. My problem for now is to minimize any overheads I can get while interacting with GPU. – otter Oct 30 '12 at 21:41
1

It seems your transfers are so small that they are completely dominated by basic latencies of the hardware and the driver stack. On my 64-bit Linux workstation (Xeon 5272 + C2050; PCIe gen2) the latency of device/host and host/device transfer is about 25 us for any size transfer between 1 byte and 1 KByte, which matches the 50 usec round trip you see. The idea for asynchronous copies is to create multiple streams of copies and kernels so that copies can overlap kernels in a pipeline. One can often get (near) perfect overlap. This optimizes throughput but cannot improve round-trip latencies. – njuffa Oct 31 '12 at 00:30
@njuffa, great thanks for giving me real numbers you have. That makes me go down to earth and makes me think there is really no way to make it work faster. I was just trying to achieve [this results](http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html) - 10us for transfer. And formally I got them: when adding any other async transfer the measured time increases by that ~10 us. But if I leave only one transfer - the time is not 10 us but 50 us. This 40 us (50-10) are very critical for me. A lot of calculation in my kernels can be made during that time. – otter Nov 01 '12 at 10:36
I don't know what underlying mechanisms dictate the minimal turn around time, but observe empirically that various combinations of motherboards and GPUs have one-shot 1KByte copy latencies anywhere from 14 usec to 25 usec with my test framework. This would appear to indicate that the latency is probably mostly due to hardware, and you may be able to reduce it somewhat by trying different HW combinations. – njuffa Nov 01 '12 at 18:13
Make logic inside iterations so that more parallelism is utilized. As you are using data from the web, you can mask the retrieval with the iterations. The best way to deal with this problem is to go for any alternative iterating algorithms which are suited for parallel programming. I have used CUDA for Power System analysis. There are two numerical methods. Gauss-Seidal and Newton-Raphson. Newton-Raphson has parallelism in it (a bit though). It increased performance when compared to Gauss-Seidal. – Fr34K Nov 05 '12 at 06:34
2

Following up on my suggestion to try different hardware combinations, the lowest host/device transfer times I have been able to find were on a 64-bit Linux system with Xeon X5550 CPU and an M2090, running CUDA 5.0. I measured 10 usec for the transfer of 1KB in either direction (host->device, device->host) using pinned memory on the host. – njuffa Nov 10 '12 at 02:29

Why CUDA memory copy speed behaves like this, some constant driver overhead?

1 Answers1

Linked