Optimising Host to GPU transfer

Question

I am offloading work to GPU using OpenCL (a variant of matrix multiplication). The matrix code itself works fantastically well, but the cost of moving data to GPU is prohibitive.

I've moved from using clEnqueueRead/clEnqueueWrite to memory mapped buffers as follows:

d_a  = clCreateBuffer(context,  CL_MEM_READ_ONLY|CL_MEM_ALLOC_HOST_PTR,
                    sizeof(char) * queryVector_size,
                    NULL, NULL);
checkErr(err,"Buf A");

d_b  = clCreateBuffer(context,  CL_MEM_READ_ONLY|CL_MEM_ALLOC_HOST_PTR,
                    sizeof(char) * segment_size,
                     NULL, NULL);

checkErr(err,"Buf B");




err  = clSetKernelArg(ko_smat, 0, sizeof(cl_mem), &d_c);
checkErr(err,"Compute Kernel");
err = clSetKernelArg(ko_smat, 1, sizeof(cl_mem), &d_a);
checkErr(err,"Compute Kernel");
err = clSetKernelArg(ko_smat, 2, sizeof(cl_mem), &d_b);
checkErr(err,"Compute Kernel");

  query_vector = (char*) clEnqueueMapBuffer(commands, d_a, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * queryVector_size, 0, NULL, NULL, &err);
 checkErr(err,"Write A");

 segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_size, 0, NULL, NULL, &err);
    checkErr(err,"Write B");

     // code which initialises buffers using ptrs (segment_data and queryV)

  err = clEnqueueUnmapMemObject(commands,
                             d_a,
                      query_vector, 0, NULL, NULL);
 checkErr(err,"Unmap Buffer");

  err = clEnqueueUnmapMemObject(commands,
                       d_b,
                      segment_data, 0, NULL, NULL);
 checkErr(err,"Unmap Buff");
 err = clEnqueueNDRangeKernel(commands, ko_smat, 2, NULL, globalWorkItems, localWorkItems, 0, NULL, NULL);

 err = clFinish(commands);
 checkErr(err, "Execute Kernel");

     result = (char*) clEnqueueMapBuffer(commands, d_c, CL_TRUE,CL_MAP_WRITE, 0, sizeof(char) * result_size, 0, NULL, NULL, &err);
     checkErr(err,"Write C");

  printMatrix(result, result_row, result_col);

This code works fine when I use the ReadEnqueue/WriteEnqueue methods and intialise d_a, d_b, d_c through that, but when I use the MappedBuffers, result is 0 due to d_a and d_b being null when running the kernel.

What is the appropriate way to map/unmap buffers?

EDIT: the core problem seems to be from here

  segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_width * segment_length, 0, NULL, NULL, &err);

  // INITIALISE

  printMatrix(segment_data, segment_length, segment_width);

  // ALL GOOD    

   err = clEnqueueUnmapMemObject(commands,
                           d_b,
                          segment_data, 0, NULL, NULL);
  checkErr(err,"Unmap Buff");

   segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_width * segment_length, 0\
, NULL, NULL, &err);

   printMatrix(segment_data, segment_length, segment_width);

   // ALL ZEROs again

The first printMatrix() returns the correct output, once I unmap it and remap it, segment_data becomes all 0s (it's initial value). I suspect I'm using an incorrect flag somewhere? I cant' figure out where though.

score 2 · Accepted Answer · answered Oct 26 '14 at 18:23

  query_vector = (char*) clEnqueueMapBuffer(commands, d_a, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * queryVector_size, 0, NULL, NULL, &err);
 checkErr(err,"Write A");

 segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_size, 0, NULL, NULL, &err);
    checkErr(err,"Write B");

The buffers are mapped as CL_MAP_READ but writing to them. Unlike buffer creation, these flags do not take a device view of the memory, but a host view, so they should be mapped using the CL_MAP_WRITE flag otherwise any changes will just be discarded when its unmapped

score 1 · Answer 2 · answered Oct 26 '14 at 12:18

1

From the OpenCL 1.2 spec:

5.4.3 Accessing mapped regions of a memory object

...

If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.

So, you need to map the results buffer after you've enqueued the kernel. Similarly, you need to unmap the input buffers before you enqueue the kernel. The timeline for mapping/unmapping buffers should be roughly as follows:

Create input buffers
Create output buffers
Map input buffers
Write input data
Unmap input buffers
Enqueue kernel
Map output buffers
Read output data
Unmap output buffers

answered Oct 26 '14 at 12:18

jprice

9,755
1
28
32

Agreed. Also, if you have multiple kernels to run another win is to overlap data transfer with kernel computation. Many high-end GPUs have dual DMA engines and can be doing an upload, a download, and compute all at the same time. By overlapping operations like this you only pay for the most expensive one. – Dithermaster Oct 26 '14 at 15:28
When I do this (which was what i was doing initially), I systematically get the output as null – user1018513 Oct 26 '14 at 17:07
@user1018513 Do you mean `clEnqueueMapBuffer` returns `NULL`? If so, what is the error code it returns? – jprice Oct 26 '14 at 17:20
I've edited the code above. Sorry for poor phrasing. My kernel views d_a and d_b as being 0 everywhere (I print out the content of query_vector and segment_data just before unmapping, and they are correctly initialised), which in turn causes result to be 0. There are no errors thrown anywhere. If i manually force result to an arbitrary value in my kernel, I can read back the correct value. Similarly, if I unmap/remap d_a, then the second time d_a is all 0s. – user1018513 Oct 26 '14 at 17:26

score 0 · Answer 3 · edited May 23 '17 at 11:49

Clearly the best way for speeding up your code is using mapped buffers. You can create the buffers using CL_MEM_ALLOC_HOST_PTR and this basically takes some transfer burden off the CPU by initiating DMA transfers.

Here is an example of using the mapped buffers:

// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));

d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR,
                       size*sizeof(cl_int), NULL, &ret);

int * map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_WRITE,
                                   0,size*sizeof(int),0,NULL,NULL,&ret);
// initialize data
for (i=0; i<size;i++) {
  map_ptr[i] = i;
}

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL); 

//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);

size_t global_work[1]  = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                             global_work, NULL, 0, 0, NULL);

map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
                             0,size*sizeof(int),0,NULL,NULL,&ret);
// copy the data to result array 
for (i=0; i<size;i++){
  host_ptr[i] = map_ptr[i];
} 

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);        

// cl finish etc

It is taken from this post.

This is what I was doing initially, but when i do that, the contents of map_ptr are systematically null. — user1018513, Oct 26 '14 at 17:06

Optimising Host to GPU transfer

3 Answers3