1

Hi I am new to OpenCL and using the C++ wrapper. Trying to run the same kernel on two devices simultaneously. The buffer is created and the attempt is to chunk it up using sub-buffers and passing those chucks to the kernel and dispatching them twice - once to Command Queue 1 and then to Command Queue 2 with different chunks of the main buffer.

When running it throws an error -13. All the other sub-buffers have been created except this one in question.

Any guidance will be much appreciated.

Using OpenCL 1.1

//Creating main buffer
cl::Buffer zeropad_buf(openclObjects.context,CL_MEM_READ_ONLY| CL_MEM_COPY_HOST_PTR,(size+2)*(size+2)*cshape[level][1]*sizeof(float),zeropad);
    cl::Buffer output_buf(openclObjects.context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR  ,cshape[level][0]*size*size*sizeof(float),output_f);

//Creating sub_buffers for zeropad_buf
    size_t zeropad_buf_size = (size+2)*(size+2)*cshape[level][1]*sizeof(float);
    size_t output_buf_size = cshape[level][0]*size*size*sizeof(float);

    cl_buffer_region zero_rgn_4core = {0, zeropad_buf_size/2};
    **cl_buffer_region zero_rgn_2core = {zeropad_buf_size/2, zeropad_buf_size/2};** //Throws error -13

    cl_buffer_region output_rgn_4core = {0, output_buf_size/2};
    cl_buffer_region output_rgn_2core = {output_buf_size/2, output_buf_size/2};



    cl::Buffer zeropad_buf_4Core = zeropad_buf.createSubBuffer(CL_MEM_READ_ONLY,CL_BUFFER_CREATE_TYPE_REGION, &zero_rgn_4core);
    **cl::Buffer zeropad_buf_2Core = zeropad_buf.createSubBuffer(CL_MEM_READ_ONLY,CL_BUFFER_CREATE_TYPE_REGION, &zero_rgn_2core);** 
    std::cout<<"zero_pad sub-buffer created"<<std::endl;

    cl::Buffer output_buf_4Core = output_buf.createSubBuffer(CL_MEM_READ_WRITE,CL_BUFFER_CREATE_TYPE_REGION, &output_rgn_4core);
    cl::Buffer output_buf_2Core = output_buf.createSubBuffer(CL_MEM_READ_WRITE,CL_BUFFER_CREATE_TYPE_REGION, &output_rgn_2core);

1 Answers1

0

From the documentation:

CL_MISALIGNED_SUB_BUFFER_OFFSET is returned in errcode_ret if there are no devices in context associated with buffer for which the origin value is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.

It looks like you might need to align your split region offsets and sizes to lie on integer multiples of the least common multiple (LCM) of the CL_DEVICE_MEM_BASE_ADDR_ALIGN properties of all of your devices.

By this, I mean something like the following:

Assuming the devices you are using are in a variable

std::vector<cl::Device> devices;

Query the CL_DEVICE_MEM_BASE_ADDR_ALIGN property for each device:

cl_uint total_alignment_requirement = 1;
for (cl::Device& dev : devices)
{
    cl_uint device_mem_base_align = 0;
    if (CL_SUCCESS == dev.getInfo(CL_DEVICE_MEM_BASE_ADDR_ALIGN, &device_mem_base_align))
        total_alignment_requirement = std::lcm(total_alignment_requirement, device_mem_base_align);
}

Then, when it comes to allocating zeropad, make sure the memory is aligned to total_alignment_requirement, for example if you're currently allocating it with malloc(), use posix_memalign() instead. (Even better, don't create the buffer using CL_MEM_USE_HOST_PTR and let OpenCL allocate the memory if you can.)

Finally, your regions need to be aligned too:

size_t zeropad_split_pos = zeropad_buf_size / 2;
zeropad_split_pos -= zeropad_split_pos % total_alignment_requirement;
cl_buffer_region zero_rgn_4core = {0, zeropad_split_pos};
cl_buffer_region zero_rgn_2core = {zeropad_split_pos, zeropad_buf_size - zeropad_split_pos};

This ensures that the first region starts and ends on an address that is a multiple of total_alignment_requirement, and the second region starts on an aligned address too.

(I haven't tested this code, but it should be close to correct. Note that std::lcm is a very new C++ standard library feature, so if that's not available in your toolchain, you'll need to supply your own lcm function.)

pmdj
  • 22,018
  • 3
  • 52
  • 103