Why does clCreateBuffer with CL_MEM_ALLOC_HOST_PTR use discrete device memory?

Question

I have a piece of code in which I use clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag and I realised that this allocates memory from the device. Is that correct and I'm missing something from the standard?

CL_MEM_ALLOC_HOST_PTR: This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.

Personally I understood that that buffer should be a host-side buffer that, later on, can be mapped using clEnqueueMapBuffer.

Follows some info about the device I'm using:

Device: Tesla K40c
Hardware version: OpenCL 1.2 CUDA
Software version: 352.63
OpenCL C version: OpenCL C 1.2

It should be a hint, not a request. Your device choose the best way it think to use which memory. — Tiger Hwang, Feb 03 '17 at 08:33
So does the "allocate memory from host accessible memory" mean: "allocates the buffer in a region of memory (in the host- or in the device-side) that can be accessible - like using clEnqueueMapBuffer - from the host"? — Nicola, Feb 03 '17 at 13:55
There was another discussion, http://stackoverflow.com/questions/25496656/cl-mem-use-host-ptr-vs-cl-mem-copy-host-ptr-vs-cl-mem-alloc-host-ptr — Tiger Hwang, Feb 03 '17 at 14:23

score 0 · Answer 1 · answered Feb 03 '17 at 12:34

0

It is described as

OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.

in

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html

The description is for CL_MEM_USE_HOST_PTR but it is only different by its allocator from CL_MEM_ALLOC_HOST_PTR. USE uses host-given pointer, ALLOC uses opencl implementation's own allocators return value.

The caching is not doable for some integrated-gpu types so its not always true.

answered Feb 03 '17 at 12:34

huseyin tugrul buyukisik

11,469
4
45
97

But as you said, only `CL_MEM_USE_HOST_PTR` is described like that. If that is also true for the `CL_MEM_ALLOC_HOST_PTR` flag, shouldn't it be written somewhere? (I'm just trying to be precise so as not to understand wrong) – Nicola Feb 03 '17 at 14:03
Allocated on host by opencl or used what has been allocated on host by user. Allocates on host. Device memory is for mapping+caching but pure mapping should be possible with just hardware cache(L1,L2). So its not always used. Host memory is just for being shadow of device memory so host interacts with device. Device memory does not need be allocated because of advantages of mapping or unification – huseyin tugrul buyukisik Feb 03 '17 at 14:33
For example, multiple devices working on same buffer but on different parts of it, could cache their own areas on device memory but when mapping, they could work in parallel to gain speed. – huseyin tugrul buyukisik Feb 03 '17 at 14:50

score 0 · Answer 2 · answered Jul 10 '17 at 06:35

The key phrase from the spec is host accessible:

This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.

It doesn't say it'll be allocated in host memory: it says it'll be accessible by the host.

This includes any memory that can be mapped into CPU-visible memory addresses. Typically some, if not all VRAM in a discrete graphics device will be available through a PCI memory range exposed in one of the BARs - these get mapped into the CPU's physical memory address space by firmware or the OS. They can be used similarly to system memory in page tables and thus made available to user processes by mapping them to virtual memory addresses.

The spec even goes on to mention this possibility, at least in combination with another flag:

CL_MEM_COPY_HOST_PTR can be used with CL_MEM_ALLOC_HOST_PTR to initialize the contents of the cl_mem object allocated using host-accessible (e.g. PCIe) memory.

If you definitely want to use system memory for a buffer (may be a good choice if GPU access to it is sparse or less frequent than CPU acccess), allocate it yourself and wrap it in a buffer with CL_MEM_USE_HOST_PTR. (Which may still end up being cached in VRAM, depending on the implementation.)

Why does clCreateBuffer with CL_MEM_ALLOC_HOST_PTR use discrete device memory?

2 Answers2