Memory copy is taking more time on GPU compared to CPU

Question

I have a source and destination pointers of the image to copy. When I run the code for the copy on CPU, its taking 2ms. Now,I ran code on open cl with:

clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,src_ptr,errcode_ret)
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,dst_ptr,errcode_ret)

and written kernel with global workgroup size(w,H).so, each kernel is copying a pixel. It's about 20ms.

Can someone please help me, how to efficiently do memory copy on open cl when we have image pointers to global memory.what is proper workgroup size to use for this process?

score 1 · Answer 1 · answered Mar 05 '20 at 20:34

1

Can you help clarify what you're trying to accomplish? Are you trying to compare the time it takes to memcpy a host buffer to the time it takes to copy a device buffer using a GPU kernel?

If so, try allocating the buffer without the CL_MEM_USE_HOST_PTR flag. From the first response here it seems like some implementations map that buffer to system memory instead of device memory, which could slow down the copy kernel.

answered Mar 05 '20 at 20:34

GandhiGandhi

1,029
6
10

I read that when we use CL_MEM_USE_HOST_PTR . GPU can access the image directly from global memory instead of copying from global memory to GPU cache and again copy to destination pointer in global memory. That's why I used it. I tried with out the flag also..No much change – Tarun Annapareddy Mar 06 '20 at 08:15
@mogu has a good answer about why CL_MEM_USE_HOST_PTR might be a little bit slower... Looks like that wasn't it here though. – GandhiGandhi Mar 06 '20 at 14:25
Can you post how you are timing the copy? That might matter in OpenCL. Also for what its worth (not sure if you're using a Nvidia GPU), but from this forum post from Robert Crovella [https://devtalk.nvidia.com/default/topic/994177/recommended-coalesced-access-word-size-/], the optimal work group isn't clear. Have you experimented with kernels copying 2, 4 or 8 pixels each? – GandhiGandhi Mar 06 '20 at 14:36
(Last comment from me) [https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueCopyBuffer.html] I would try `clEnqueueCopyBuffer` if I was in your shoes and needed to copy a device buffer – GandhiGandhi Mar 06 '20 at 14:39

score 1 · Answer 2 · answered Mar 06 '20 at 09:27

how to efficiently do memory copy on open cl when we have image pointers to global memory

The efficient way is to use memcpy() on the host pointers. IOW use the CPU.

when we use CL_MEM_USE_HOST_PTR, GPU can access the image directly from global memory instead of copying from global memory

That's not strictly true. It's true for integrated GPUs (if the host_ptr memory pointer is properly aligned). Discrete GPUs will still copy host memory to their own memory over the PCI express bus. If you read the documentation for clCreateBuffer, it says:

CL_MEM_USE_HOST_PTR ... OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.

Discrete GPUs cannot directly "work" on host memory. Even if they could, it would be so slow as to be pointless.

In fact using CL_MEM_USE_HOST_PTR with a discrete GPU may result in worse performance, because the GPU will have to keep the host copy in sync with its own copy, which will result in a lot of PCIe transfers. CL_MEM_USE_HOST_PTR only makes sense with integrated GPUs to save unnecessary transfers and memory copies.

Generally the way you work with GPUs is to minimize memory transfers, so you create buffers once (with clCreateBuffer), then launch the kernels you need on them, and then either transfer result back to host (via enqueueReadImage) or display it with OpenGL interop. You'll have to clarify what you're doing if you want more useful advice.

Memory copy is taking more time on GPU compared to CPU

2 Answers2