How to use clEnqueueMigrateMemObjects in order to pass buffers in between GPUs under a same context

Question

I am trying to use multiple GPUs to work on the problem domain. Main issue is that I somehow have to find a way to effectively pass buffers between the GPUs. The buffer that needs to be passed is boundary values of the array that each GPUs are assigned to, so that once these values are updated every time step the whole process can repeat for the next time step.

From internet search, I've found out that clEnqueueMigrateMemObects is for this purpose. But, I can not find any examples regarding cross GPU buffer transfers. Only one explanation that I have found is this post. The part that I am having trouble understanding is this part (where I put the arrow)

command queue on device 1:

migrate memory buffer1

enqueue kernels that process this buffer

==> save last event associated with buffer1 processing <==

command queue on device 2:

migrate memory buffer1 - use the event produced by queue 1 to sync the migration.

enqueue kernels that process this buffer

So, the example code would be something like below? (given that I have two OpenCL devices using the same platform and the same context...)

...
cl_context context = clCreateContext(NULL, numDevices, devices, NULL, NULL, &status);
cl_command_queue cmdq_dev0, cmdq_dev1;

cmdq_dev0 = clCreateCommandQueue(context, devices[0], 0, &status);
cmdq_dev1 = clCreateCommandQueue(context, devices[1], 0, &status);

cl_mem dev0_buf, dev1_buf, common_buf;

dev0_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);
dev1_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);
common_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);

status   =  clEnqueueWriteBuffer(cmdq_dev0, buf_arr  , CL_TRUE, 0, some_siz, dev0_arr, 0, NULL, NULL);
status   =  clEnqueueWriteBuffer(cmdq_dev0, common_buf, CL_TRUE, 0, common_siz, common_arr, 0, NULL, NULL);

status   =  clEnqueueWriteBuffer(cmdq_dev1, buf_arr  , CL_TRUE, 0, some_siz, dev1_arr, 0, NULL, NULL);
status   =  clEnqueueWriteBuffer(cmdq_dev1, common_buf, CL_TRUE, 0, common_siz, common_arr, 0, NULL, NULL);

/* build some opencl program */

cl_kernel kernel0, kernel1

kernel0 = clCreateKernel(program, "kernel0", &status);
kernel1 = clCreateKernel(program, "kernel1", &status);

status = clSetKernelArg(kernel0, 0, sizeof(cl_int), &dev0_arr  );
status = clSetKernelArg(kernel0, 1, sizeof(cl_int), &common_arr  );

status = clSetKernelArg(kernel1, 0, sizeof(cl_int), &dev1_arr  );
status = clSetKernelArg(kernel1, 1, sizeof(cl_int), &common_arr  );

/* part where kernels are executed */

status = clEnqueueNDRangeKernel(cmdq_dev0, kernel0, 3, NULL, something, NULL, 0, NULL, NULL);

status = clEnqueueMigrateMemObjects(cmdq_dev0, 1, &common_buf, CL_MIGRATE_MEM_OBJECT_HOST,0,NULL,NULL);

status = clEnqueueNDRangeKernel(cmdq_dev1, kernel0, 3, NULL, something, NULL, 0, NULL, NULL);

status = clEnqueueMigrateMemObjects(cmdq_dev1, 1, &common_buf, CL_MIGRATE_MEM_OBJECT_HOST,0,NULL,NULL);

...

In addition, I am confused about the command queue that I should be specifying in the function clEnqueueMigrateMemObjects when it comes to passing the common_buf buffer object from device 0 to device 1, and vice versa.

Thanks.

score 0 · Accepted Answer · answered Jun 15 '22 at 21:19

0

The thing that is missing in your in your code is inter-queue synchronization using event objects (search for - OpenCL events, OpenCL synchronization).

When working with multiple queues (whether on the same device or not), you must synchronize their execution when they access shared data. Commands that are enqueued to separate queues can run in parallel to each other. The order in which you enqueue to separate queues does not matter. They do not execute in order with respect to each other, only with respect to themselves.

A typical scenario, as in you case, is a producer-consumer type of workflow. where the producer queue produces data that is later read by the consumer queue. The producer queue must supply the consumer queue with an event object to wait on prior to reading the produced data.

The part that I am having trouble understanding is this part (where I put the arrow)

This part means that you should obtain the cl_event object from the last enqueue call (the last parameter to clEnqueueNDRangeKernel) and give it in the wait list parameter of clEnqueueMigrateMemObjects in the other queue.

In addition, I am confused about the command queue that I should be specifying in the function clEnqueueMigrateMemObjects when it comes to passing the common_buf buffer object from device 0 to device 1, and vice versa.

The queue that should call this function is the consumer queue that intends to enqueue a kernel to access the buffer. But you should do this only after the other producer queue has finished writing to the buffer (by using the cl_event object described earlier as the wait list for clEnqueueMigrateMemObjects).

answered Jun 15 '22 at 21:19

Elad Maimoni

3,703
3
20
37

Thank you for your tip! I've never realized I would need an `cl_event` variable to be passed in order for `clEnqueueMigrateMemObects()`to work properly. I have another question. If you have multiple memory objects you wish to migrate, how should I specify that in `clEnqueueMigrateMemObects()`? Reference manual says 'mem_objects' to be a 'list' of memory objects. Or should I just call `clEnqueueMigrateMemObects()` multiple times for multiple memory objeccts? Thanks. – Redshoe Jun 16 '22 at 03:44
if the list of memory objects are synchronized by the same event wait list, you could use a single call – Elad Maimoni Jun 16 '22 at 08:30
Hmm.. I don't quite get it. So, if, for example, I execute `clEnqueueNDRangeKernel()` with 4 variables (let's say a,b,c,d), I could use `clEnqueueMigrateMemObjects()` to pass the entire 4 variables with a single call? Then my next question would be what should be the 'mem_objects' should be? Could it be something like `clEnqueueMigrateMemObjects(cmdq_dev0,4,&a,&b,&c,&d,...)`? – Redshoe Jun 16 '22 at 13:10
you should supply an array of cl_mem variables. note that clEnqueueMigrateMemObjects accepts a pointer (potentially an array or a single variable) – Elad Maimoni Jun 16 '22 at 13:27
I understand that. But my confusion is that what if there are multiple memory object (number of arrays) that I would like to pass? – Redshoe Jun 16 '22 at 15:29
multiple memory objects means an array of cl_mem objects. `cl_mem params[] {a, b, c, d}`. – Elad Maimoni Jun 16 '22 at 15:35
Thanks. Is it possible to pass array of pointers? I have 4 of 1D arrays that I would like to migrate from a GPU to another GPU. – Redshoe Jun 16 '22 at 16:33

How to use clEnqueueMigrateMemObjects in order to pass buffers in between GPUs under a same context

1 Answers1