I am trying to use multiple GPUs to work on the problem domain. Main issue is that I somehow have to find a way to effectively pass buffers between the GPUs. The buffer that needs to be passed is boundary values of the array that each GPUs are assigned to, so that once these values are updated every time step the whole process can repeat for the next time step.
From internet search, I've found out that clEnqueueMigrateMemObects
is for this purpose. But, I can not find any examples regarding cross GPU buffer transfers. Only one explanation that I have found is this post. The part that I am having trouble understanding is this part (where I put the arrow)
command queue on device 1:
- migrate memory buffer1
- enqueue kernels that process this buffer
- ==> save last event associated with buffer1 processing <==
command queue on device 2:
- migrate memory buffer1 - use the event produced by queue 1 to sync the migration.
- enqueue kernels that process this buffer
So, the example code would be something like below? (given that I have two OpenCL devices using the same platform and the same context...)
...
cl_context context = clCreateContext(NULL, numDevices, devices, NULL, NULL, &status);
cl_command_queue cmdq_dev0, cmdq_dev1;
cmdq_dev0 = clCreateCommandQueue(context, devices[0], 0, &status);
cmdq_dev1 = clCreateCommandQueue(context, devices[1], 0, &status);
cl_mem dev0_buf, dev1_buf, common_buf;
dev0_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);
dev1_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);
common_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, some_siz, NULL, &status);
status = clEnqueueWriteBuffer(cmdq_dev0, buf_arr , CL_TRUE, 0, some_siz, dev0_arr, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdq_dev0, common_buf, CL_TRUE, 0, common_siz, common_arr, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdq_dev1, buf_arr , CL_TRUE, 0, some_siz, dev1_arr, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdq_dev1, common_buf, CL_TRUE, 0, common_siz, common_arr, 0, NULL, NULL);
/* build some opencl program */
cl_kernel kernel0, kernel1
kernel0 = clCreateKernel(program, "kernel0", &status);
kernel1 = clCreateKernel(program, "kernel1", &status);
status = clSetKernelArg(kernel0, 0, sizeof(cl_int), &dev0_arr );
status = clSetKernelArg(kernel0, 1, sizeof(cl_int), &common_arr );
status = clSetKernelArg(kernel1, 0, sizeof(cl_int), &dev1_arr );
status = clSetKernelArg(kernel1, 1, sizeof(cl_int), &common_arr );
/* part where kernels are executed */
status = clEnqueueNDRangeKernel(cmdq_dev0, kernel0, 3, NULL, something, NULL, 0, NULL, NULL);
status = clEnqueueMigrateMemObjects(cmdq_dev0, 1, &common_buf, CL_MIGRATE_MEM_OBJECT_HOST,0,NULL,NULL);
status = clEnqueueNDRangeKernel(cmdq_dev1, kernel0, 3, NULL, something, NULL, 0, NULL, NULL);
status = clEnqueueMigrateMemObjects(cmdq_dev1, 1, &common_buf, CL_MIGRATE_MEM_OBJECT_HOST,0,NULL,NULL);
...
In addition, I am confused about the command queue that I should be specifying in the function clEnqueueMigrateMemObjects
when it comes to passing the common_buf
buffer object from device 0 to device 1, and vice versa.
Thanks.