3

I've a vector on the host and I want to halve it and send to the device. Doing a benchmark shows that CL_MEM_ALLOC_HOST_PTR is faster than CL_MEM_USE_HOST_PTR and much faster than CL_MEM_COPY_HOST_PTR. Also memory analysis on device doesn't show any difference in the buffer size created on device. This differs from the documentation of the mentioned flag on Khronos- clCreateBuffer. Does anyone know what's going on?

Quonux
  • 2,975
  • 1
  • 24
  • 32
Damoon
  • 371
  • 5
  • 16
  • hello, are you using a gpu, cpu, or apu device to run the kernel? also, which vendor? does your system have multiple processors (NUMA architecture?) – mfa Mar 05 '12 at 14:25
  • COPY_HOST_PTR does an implicit copy, but ALLOC_HOST_PTR requires an explicit copy. When you're running your benchmark for ALLOC_HOST_PTR, are you sure you're including the extra step to copy the buffer from host to device? If not, that might explain why it's so much faster. – vocaro Mar 05 '12 at 20:08
  • I'm using NVIDIA gpu as the device. – Damoon Mar 06 '12 at 08:13

3 Answers3

8

The answer by Pompei 2 is incorrect. The specification makes no guarantee as to where the memory is allocated but only how it is allocated. CL_MEM_ALLOC_HOST_PTR makes the clCreateBuffer allocate the host side memory for you. You can then map this into a host pointer using clEnqueueMapBuffer. CL_MEM_USE_HOST_PTR will cause the runtime to scoop up the data you give it into a OpenCL buffer.

Pinned memory is achieved through the use of CL_MEM_ALLOC_HOST_PTR: the runtime is able to allocate the memory as it can.

All this performance is implementation dependant. Reading section 3.1.1 more carefully will show that in one of the calls (with no CL_MEM flag) NVIDIA is able to preallocate a device side buffer whilst the other calls merely get the pinned data mapped into a host pointer ready for writing to the device.

homemade-jam
  • 245
  • 4
  • 12
2

First off and if I understand you correctly, clCreateSubBuffer is probably not what you want, as it creates a sub-buffer from an existing OpenCL buffer object. The documentation you linked also tells us that:

The CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR and CL_MEM_COPY_HOST_PTR values cannot be specified in flags but are inherited from the corresponding memory access qualifiers associated with buffer.

You said you have a vector on the host and want to send half of it to the device. For this, I would use a regular buffer of half the vector's size (in bytes) on the device.

Then, with a regular buffer, the performance you see is expected.

  1. CL_MEM_ALLOC_HOST_PTR only allocates memory on the host, which does not incur any transfer at all: it is like doing a malloc and not filling the memory.
  2. CL_MEM_COPY_HOST_PTR will allocate a buffer on the device, most probably the RAM on GPUs, and then copy your whole host buffer over to the device memory.
  3. On GPUs, CL_MEM_USE_HOST_PTR most likely allocates so-called page-locked or pinned memory. This kind of memory is the fastest for host->GPU memory transfer and this is the recommended way to do the copy.

To read how to correctly use pinned memory on NVidia devices, refer to chapter 3.1.1 of NVidia's OpenCL best practices guide. Note that if you use too much pinned memory, performance may drop below a host copied memory.

The reason why pinned memory is faster than copied device memory is well-explained in this SO question aswell as this forum thread it points to.

Community
  • 1
  • 1
LucasB
  • 3,253
  • 1
  • 28
  • 31
  • I don't agree with your (1). `CL_MEM_ALLOC_HOST_PTR -- This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.` – Constantin Mar 28 '13 at 15:19
  • 3
    You get pinned memory with CL_MEM_ALLOC not USE – homemade-jam Apr 27 '13 at 12:35
1

Pompei2, you says CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR allocates memory on the device while OpenCL 1.1 Specification says that with CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR specified memory will be allocated (in first case) on or will be used from (in second) host memory? Im newble in OpenCL, but want know where is true?)

Ivan
  • 11
  • 1
  • Sorry, I have difficulties parsing your question. In any case, if the specification contradicts me, I am wrong and the specs are right. – LucasB Mar 29 '13 at 15:28