0

I am a little confused as to whether my code using OpenCL mapped buffers are correct.

I have two examples, one using CL_MEM_USE_HOST_PTR and one using CL_MEM_ALLOC_HOST_PTR. Both work and run on my local machine and OpenCL devices but I am interested in whether this is the correct way of doing the mapping, and whether it should work an all OpenCL devices. I am especially unsure about the USE_HOST_PTR example.

I am only interested in the buffer/map specific operations. I am aware I should do error checking and so forth.

CL_MEM_ALLOC_HOST_PTR:

// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));

d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR,
                       size*sizeof(cl_int), NULL, &ret);

int * map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_WRITE,
                                   0,size*sizeof(int),0,NULL,NULL,&ret);
// initialize data
for (i=0; i<size;i++) {
  map_ptr[i] = i;
}

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL); 

//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);

size_t global_work[1]  = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                             global_work, NULL, 0, 0, NULL);

map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
                             0,size*sizeof(int),0,NULL,NULL,&ret);
// copy the data to result array 
for (i=0; i<size;i++){
  host_ptr[i] = map_ptr[i];
} 

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);        

// cl finish etc     

CL_MEM_USE_HOST_PTR:

// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));
int i;
for(i=0; i<size;i++) {
  host_ptr[i] = i;
}

d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR,
                       size*sizeof(cl_int), host_ptr, &ret);

// No need to map or unmap here, as we use the HOST_PTR the original data
// is already initialized into the buffer?

//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);

size_t global_work[1]  = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                             global_work, NULL, 0, 0, NULL);

// this returns the host_ptr so need need to save it (I assume it always will?)
// although we do need to call the map function
// to ensure the data is copied back.
// There's no need to manually copy it back into host_ptr
// as it uses this by default
clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
                   0,size*sizeof(int),0,NULL,NULL,&ret); 

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);        

// cl finish, cleanup etc
VAndrei
  • 5,420
  • 18
  • 43
Jacob
  • 3,521
  • 6
  • 26
  • 34

1 Answers1

0

If you use CL_MEM_ALLOC_HOST_PTR you have the chance that the underlying implementation of OpenCL might use page-locked memory.

That means that the page cannot be swapped out to disk and that the transfer between host and device memory would be done DMA style without wasting CPU cycles. Therefore in this case CL_MEM_ALLOC_HOST_PTR would be the best solution.

nVidia has the page-locked (pinned) memory feature and they should also use it in their OpenCL implementation. For AMD it's not certain if they do the same. Check here for more details.

Using CL_MEM_USE_HOST_PTR would just make the programmer's life easier so in the unlikely case when the hardware cannot use page-locked memory you could just use this option.

Community
  • 1
  • 1
VAndrei
  • 5,420
  • 18
  • 43
  • Hi, thanks for your answer. Could you include a better way of implementing the first example which would not raise performance issues? I am not quite sure I understand why I am not sure the cached copy in device memory is used. – Jacob Oct 09 '14 at 11:47
  • Hmm that's interesting. I think nVidia recommends doing map,allocation,,write,kernel exection,read as per http://www.nvidia.com/content/cudazone/cudabrowser/downloads/papers/nvidia_opencl_bestpracticesguide.pdf and they also recommends using the ALLOC_HOST_PTR instead of USE_HOST_PTR. I think Intel also recommends using ALLOC_HOST_PTR as you otherwise need to ensure correct page boundary. https://software.intel.com/sites/landingpage/opencl/optimization-guide/index.htm . ARM too: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/BABJHCCH.html – Jacob Oct 09 '14 at 12:53
  • 1
    ALLOC_HOST_PTR may outperform USE_HOST_PTR since the runtime does its best to get the fastest memory available (for example pinned pages). – user703016 Oct 09 '14 at 13:02
  • Thanks @Cicada. In the case of USE_HOST_PTR would DMA transfers be possible? I guess that depends on how you allocated your memory, however it seems unlikely. So advantage of using USE_HOST_PTR over normal buffers is that if you are on a CPU implementation it would (properly) not need a data copy? – Jacob Oct 09 '14 at 13:18
  • 1
    Don't quote me on this, but I believe DMA transfers on USE_HOST_PTR would work **if** you properly allocated the memory as pinned and if the implementation supports it. The advantages of USE_HOST_PTR are simply that the programmer controls the allocation more finely than ALLOC_HOST_PTR. – user703016 Oct 09 '14 at 13:22