Why is it faster to transfer data from CPU to GPU rather than GPU to CPU?

Question

I've noticed that transferring data to recent high end GPUs is faster than gathering it back to the CPU. Here are the results using a benchmarking function provided to me by mathworks tech-support running on an older Nvidia K20 and a recent Nvidia P100 with PCIE:

Using a Tesla P100-PCIE-12GB GPU.
Achieved peak send speed of 11.042 GB/s
Achieved peak gather speed of 4.20609 GB/s

Using a Tesla K20m GPU.
Achieved peak send speed of 2.5269 GB/s
Achieved peak gather speed of 2.52399 GB/s

I've attached the benchmark function below for reference. What is the reason for the asymmetry on the P100? Is this system dependent or is it the norm on recent high end GPUs? Can the gather speed be increased?

gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name)
sizeOfDouble = 8; % Each double-precision number needs 8 bytes of storage
sizes = power(2, 14:28);

sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
    numElements = sizes(ii)/sizeOfDouble;
    hostData = randi([0 9], numElements, 1);
    gpuData = randi([0 9], numElements, 1, 'gpuArray');
    % Time sending to GPU
    sendFcn = @() gpuArray(hostData);
    sendTimes(ii) = gputimeit(sendFcn);
    % Time gathering back from GPU
    gatherFcn = @() gather(gpuData);
    gatherTimes(ii) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Achieved peak send speed of %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Achieved peak gather speed of %g GB/s\n',max(gatherBandwidth))

Edit: we now know it is not system dependent (see comments) . I still want to know the reason for the assymetry or if it can be changed.

To answer your second question, I can reproduce the results on a Quadro M5000 GPU, here I have a peak send speed of 10.0442 GB/s and peak gather speed of 3.66208 GB/s. So it does not seem like it is system dependent. — Nicky Mattsson, May 12 '18 at 07:17
Thanks! I also tried it on other systems with similar results. So now we can confirm it is not system dependent. — avgn, May 12 '18 at 18:33
Considering GPUs are primarily designed to produce graphics on screen, it makes sense to make upload speed a priority over download speed. — Cris Luengo, May 12 '18 at 21:00
I think for the sake of completeness, this benchmark should be also done in a different environment/language to rule out MATLAB-related quirks. — Dev-iL, May 23 '18 at 09:15
Is the peak representative of the speed? I would definitely trust more the average. — Ander Biguri, May 23 '18 at 11:07

score 4 · Accepted Answer · answered May 22 '18 at 17:52

This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.

System: Win10, 32GB DDR4-2400Mhz RAM, i7 6700K. MATLAB: R2018a.

Using a GeForce GTX 660 GPU.
Achieved peak send speed of 7.04747 GB/s
Achieved peak gather speed of 3.11048 GB/s

Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes
longer.

Contributor: Dev-iL

System: Win7, 32GB RAM, i7 4790K. MATLAB: R2018a.

Using a Quadro P6000 GPU.
Achieved peak send speed of 1.43346 GB/s
Achieved peak gather speed of 1.32355 GB/s

Contributor: Dev-iL

score 1 · Answer 2 · answered May 08 '19 at 13:43

I am not familiar with Matlab GPU toolboxes, but I suspect that the second transfer (that gets data back from GPU) starts before the first has ended.

% Time sending to GPU
sendFcn = @() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
%
%No synchronization here
%
% Time gathering back from GPU
gatherFcn = @() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);

A similar question, for a C program, was posted here:

copy from GPU to CPU is slower than copying CPU to GPU

In that case, there is no explicit sync after launching a thread on the GPU and getting results data back from the GPU. So the function that gets data back, in C cudaMemcpy(), has to wait for the GPU to end the previous launched thread, before transferring data, thus inflating the time measured for the data transfer.

With the Cuda C API, it is possible to force the CPU to wait for the GPU to end the previously launched thread(s), with:

cudaDeviceSynchronize();

And only then start measuring the time to transfer data back.

Maybe in Matlab there is also some synchronization primitive.

Also in the same answer, it is recommended to measure time with (Cuda) Events.

In this POST on optimizing data transfers, also in C sorry, Events are used to measure data transfer times:

https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

The time for transferring data is the same in both directions.

Why is it faster to transfer data from CPU to GPU rather than GPU to CPU?

2 Answers2

This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.