how to efficiently do memory copy on open cl when we have image
pointers to global memory
The efficient way is to use memcpy()
on the host pointers. IOW use the CPU.
when we use CL_MEM_USE_HOST_PTR, GPU can access the image directly from global memory instead of copying from global memory
That's not strictly true. It's true for integrated GPUs (if the host_ptr memory pointer is properly aligned). Discrete GPUs will still copy host memory to their own memory over the PCI express bus. If you read the documentation for clCreateBuffer, it says:
CL_MEM_USE_HOST_PTR ... OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.
Discrete GPUs cannot directly "work" on host memory. Even if they could, it would be so slow as to be pointless.
In fact using CL_MEM_USE_HOST_PTR with a discrete GPU may result in worse performance, because the GPU will have to keep the host copy in sync with its own copy, which will result in a lot of PCIe transfers. CL_MEM_USE_HOST_PTR only makes sense with integrated GPUs to save unnecessary transfers and memory copies.
Generally the way you work with GPUs is to minimize memory transfers, so you create buffers once (with clCreateBuffer), then launch the kernels you need on them, and then either transfer result back to host (via enqueueReadImage) or display it with OpenGL interop. You'll have to clarify what you're doing if you want more useful advice.