2

First, I am having a hard time figuring out how clCreateBuffer() works when passed CL_MEM_ALLOC_HOST_PTR. Does it create a buffer on the device AND allocate memory for the host, or does it only allocate memory on the host and cache it on the device when it's being used?

My problem is this: If I have quite a few objects that have float* fields that total more space than is available on my device, is there a better way then telling the runtime to copy the host pointer (or use it) to the OpenCL device? Is it possible to have the runtime create the host pointer and use that for all the float* even if they total more memory than the device has? I wouldn't mind telling it to use the host pointer, but if I wanted to avoid memory copies when the runtime is on the CPU I would have to align all the memory.

Also, any tips on good ways to deal with using more memory on the host than is available on the device to make memory transfers the most efficient and do the least copying.

Thanks.

contrapsych
  • 1,919
  • 4
  • 29
  • 44
  • I asked a similar (but more specific) question on [AMD's forums](http://devgurus.amd.com/message/1298052). Maybe you would be interested by some (partial) solutions that came up. – Simon Jun 17 '13 at 11:55

1 Answers1

3

The standard only states that:

This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.

So, how it works under the hood is implementation dependent. NVIDIA states in section 3.3.1 of its OpenCL Programming Guide (V4.2) that:

objects using the CL_MEM_ALLOC_HOST_PTR flag (...) are likely to be allocated in page-locked memory by the driver for best performance.

In their own guide (here) AMD, gives in section 4.5.2 a table displaying the location of the memory objects for each flag values. The entire section 4.5 is dedicated to OCL memory objects. You might find it interesting.

Regarding your problem, if you don't have enough memory space there is no other solution (at least that I can think of) than to split your data and process it in several passes as suggested here.

Community
  • 1
  • 1
CaptainObvious
  • 2,525
  • 20
  • 26
  • Assuming you have more than enough **host**-memory to store all your data, regarding AMD's table it doesn't seem possible to allocate everything (on host) and transfert progressively to the device without any copy. Did I missed something ? With NVidia this case is easily solved... – Simon Jun 13 '13 at 14:57
  • @Simon: Eventually for both AMD and NVIDIA data will be sent to the discrete GPU. The performance increase doesn't come from a progressive transfert, but from the fact that with the CL_MEM_ALLOC_HOST_PTR flag, it's **likely** that the object will be allocated in pinned memory. With pinned memory the device can fetch data using DMA thus decreasing CPU workload. See this [post](http://stackoverflow.com/questions/5736968/why-is-cuda-pinned-memory-so-fast). – CaptainObvious Jun 14 '13 at 08:42
  • Yes of course, that's not what I meant. But it seams that there is no way to tell OpenCL (at least on AMD) that a chunk of host-memory is page-locked and non-cacheable and that the transfer can be done "safely" without any copy. On an NVidia platform, it's straightforward, as explained in the [best pratices guide](http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf) (section 3.1.1). – Simon Jun 14 '13 at 11:10
  • Sorry I misunderstood. However even for NVIDIA it is not sure 100% that memory will be pinned when using the CL_MEM_ALLOC_HOST_PTR flag. The guide states that it is **likely**. Regarding AMD, the 4.5.1.1 section of their programming guide explain with some details how data is transferred to the devices depending on the size of the data. – CaptainObvious Jun 16 '13 at 17:29
  • "Likely" is typically one of those evil words I hate in documentations/specifications... Thanks for pointing that out to me. – Simon Jun 17 '13 at 11:53