cudaMemcpy too slow

Question

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal?
Does the function itself have so much overhead before copying? Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?

EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine

 __host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
  time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
  printf(...);
}
t1 = t2;
}

main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);

Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.

Can you add some information to your question about how you are doing the timing measurements? — talonmies, Sep 15 '11 at 11:30

osgx · Answer 1 · 2011-09-15T12:19:05.203

12

Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.

You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:

Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().

Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):

Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

edited Sep 15 '11 at 12:19

answered Sep 15 '11 at 11:35

osgx

90,338
53
357
513

See also: [Why is CUDA pinned memory so fast?](http://stackoverflow.com/q/5736968/196561) – osgx Sep 15 '11 at 11:40
Ok the "upload" (to device) went down to 1.1s but the "download" (from device) is still at 5s. Is this normal too? – Callahan Sep 15 '11 at 12:14
Is the download to Pinned memory too? – osgx Sep 15 '11 at 12:15
Yes. In this test I download to the same memory I uploaded from. – Callahan Sep 15 '11 at 12:24
Callahan, can you post details of code? the lines with: memory allocation; uploading; starting the kernel (and the kernel source); downloading, timings. The right timing method for Cuda is to use Cuda's timers, check code samples, bundled with CUDA. Or here: https://www.cs.virginia.edu/~csadmin/wiki/index.php/CUDA_Support/Measuring_kernel_runtime – osgx Sep 15 '11 at 16:22
I can't agree those copy performance numbers are normal, unless the card is in a very slow slot. All the G80s I used were capable of hitting about 1.8Gb/s on large transfers in a 16 lane PCI-e slot, perhaps 2.5Gb/s to pinned memory. The reported performance is only about 10% of that, which is exceedingly low. – talonmies Sep 15 '11 at 19:13
Posted the relevant code. I didnt use a kernel so far. As described in the edit the suggested Cuda timings don't work for me. – Callahan Sep 16 '11 at 11:31

score 6 · Answer 2 · edited Sep 16 '11 at 04:34

Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.

The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).

Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:

http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf

What can you say about slow Download (card->host), even to pinned memory? — osgx, Sep 15 '11 at 21:08
Unfortunatly I will not get neither to the machine nor to the bios because it is a remote one I am working on. — Callahan, Sep 16 '11 at 11:35

score 1 · Answer 3 · answered Sep 16 '11 at 12:47

Other than a system that just is not configured properly, the best explanation for dreadful PCIe bandwidth is a mismatch between IOH/socket and the PCIe slot that the GPU is plugged into.

Most multi-socket Intel i7-class (Nehalem, Westmere) motherboards have one I/O hub per socket. Since the system memory is directly connected to each CPU, DMA accesses that are "local" (fetching memory from the CPU connected to the same IOH as the GPU doing the DMA access) are much faster than nonlocal ones (fetching memory from the CPU connected to the other IOH, a transaction that has to be satisfied via the QPI interconnect that links the two CPUs).

IMPORTANT NOTE: unfortunately it is common for SBIOS's to configure systems for interleaving, which causes contiguous memory allocations to be interleaved between the sockets. This mitigates performance cliffs from local/nonlocal access for the CPUs (one way to think of it: it makes all memory accesses equally bad for both sockets), but wreaks havoc with GPU access to the data since it causes every other page on a 2-socket system to be nonlocal.

Nehalem and Westmere class systems don't seem to suffer from this problem if the system only has one IOH.

(By the way, Sandy Bridge class processors take another step down this path by integrating the PCI Express support into the CPU, so with Sandy Bridge, multi-socket machines automatically have multiple IOH's.)

You can investigate this hypothesis by either running your test using a tool that pins it to a socket (numactl on Linux, if it's available) or by using platform-dependent code to steer the allocations and threads to run on a specific socket. You can learn a lot without getting fancy - just call a function with global effects at the beginning of main() to force everything onto one socket or another, and see if that has a big impact on your PCIe transfer performance.

I executed a `cudaSetDevice(0);` at the beginning as a "global effect" function. Timings did not change. — Callahan, Sep 18 '11 at 17:24

cudaMemcpy too slow

3 Answers3