4

I am now learning boost::compute openCL wrapper library. I am experiencing very slow copy procedure.

If we scale CPU to CPU copy speed as 1, how fast is GPU to CPU, GPU to GPU, CPU to GPU copy?

I don't require precise numbers. Just a general idea would be a great help. In example CPU-CPU is at least 10 times faster than GPU-GPU.

Zeta
  • 913
  • 10
  • 24
  • 2
    Totally depends on your hardware setup and software techniques, but if done right should be in the 40 to 90 percent of PCIe bandwidth, provided your transfers are large enough (and plenty of other factors, including whether your GPU is in a 16 lane slot). If I recall correctly, I see roughly 5-6 GB/s. – Dithermaster Dec 05 '17 at 14:22
  • I came across this question when considering copying an image buffer via Pixelcopy https://stackoverflow.com/a/65932521/10183099 and OpenGL ES2. Basically, whether `ram_location1 to ram_location2 and ram_location2 to ram_location3` is faster than `ram_location1 to vram_location1 and vram_location1 to ram_location2`. However, I am still not sure. I am concerned with snapdragon 865. – zeitgeist Feb 26 '22 at 21:54

1 Answers1

7

No one is answering my question. So I made a program to check the copy speed.

#include<vector>
#include<chrono>
#include<algorithm>
#include<iostream>
#include<boost/compute.hpp>
namespace compute = boost::compute;
using namespace std::chrono;
using namespace std;

int main()
{
    int sz = 10000000;
    std::vector<float> v1(sz, 2.3f), v2(sz);
    compute::vector<float> v3(sz), v4(sz);

    auto s = system_clock::now();
    std::copy(v1.begin(), v1.end(), v2.begin());
    auto e = system_clock::now();
    cout << "cpu2cpu cp " << (e - s).count() << endl;

    s = system_clock::now();
    compute::copy(v1.begin(), v1.end(), v3.begin());
    e = system_clock::now();
    cout << "cpu2gpu cp " << (e - s).count() << endl;

    s = system_clock::now();
    compute::copy(v3.begin(), v3.end(), v4.begin());
    e = system_clock::now();
    cout << "gpu2gpu cp " << (e - s).count() << endl;

    s = system_clock::now();
    compute::copy(v3.begin(), v3.end(), v1.begin());
    e = system_clock::now();
    cout << "gpu2cpu cp " << (e - s).count() << endl;
    return 0;
}

I expected that gpu2gpu copy would be fast. But on the contrary, cpu2cpu was fastest and gpu2gpu was so slow in my case. (My system is Intel I3 and Intel(R) HD Graphics Skylake ULT GT2.) Maybe parallel processing is one thing and copy speed is another.

cpu2cpu cp 7549776
cpu2gpu cp 18707268
gpu2gpu cp 65841100
gpu2cpu cp 65803119

I hope anyone can benefit with this test program.

Zeta
  • 913
  • 10
  • 24
  • 1
    If I am not wrong, Intel HD Graphics Skylake has no video-RAM, but shares the memory with the CPU. It would be interesting to compare the result on different systems. – Pietro Mar 05 '20 at 18:29
  • 1
    stackoverflow needs to improve. – John Jiang Oct 14 '21 at 20:03
  • I know next to nothing about compute.hpp but I believe there must be a big mistake here. Maybe you're not using the parallel architecture of a GPU. Or you actually sending everything through the CPU? No way a CPU copies faster than a GPU, that makes no sense. Greetings – Stefan Reich Jun 02 '22 at 09:58