3

I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently?

One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out..

Nike
  • 455
  • 1
  • 5
  • 16
  • 1
    If you are placing all the kernels in the same command queue, they indeed would be executed sequentially (that's why it's called queue). Mesuring the time for each individual kernel and for total execution time is the only possible way to practically measure whether kernels are executed in parallel or not, I'm afraid. – aland Aug 01 '12 at 16:57
  • I have placed all the kernels in differnet command queues. – Nike Aug 01 '12 at 17:16
  • Your assumption is that the CUDA Compute Capability 3.0, which supports 16 CUDA streams on the Fermi architecture, is available an an OpenCL feature? Is there anything in the NVida docs to support this assumption? If you are expecting to use OpenCL Device Fission, this extension can be queried for as a supported extensions capability using clGetDeviceInfo () – Tim Child Aug 05 '12 at 14:27
  • I read about the support for concurrent kernels in the [OpenCL Programming Guide](http://developer.download.nvidia.com/compute/DevZone/docs/html/OpenCL/doc/OpenCL_Programming_Guide.pdf) Section 3.2.2 – Nike Aug 05 '12 at 18:51

3 Answers3

13

To determine if your kernel executions overlap, you have to profile them. This requires several steps:

1. Creating the command-queues

Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:

cl_command_queue queues[10];
for (int i = 0; i < 10; ++i) {
  queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
                                   &errcode);
}

2. Making sure all kernels start at the same time

You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:

// Create the user event
cl_event user_event = clCreateUserEvent(context, &errcode);

// Reserve space for kernel events
cl_event kernel_events[10];

// Enqueue kernels
for (int i = 0; i < 10; ++i) {
  clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
                         global_work_size, 1, &user_event, &kernel_events[i]);
}

// Start all kernels by completing the user event
clSetUserEventStatus(user_event, CL_COMPLETE);

3. Obtain profiling times

Finally, we can collect the timing information for the kernel events:

// Block until all kernels have run to completion
clWaitForEvents(10, kernel_events);

for (int i = 0; i < 10; ++i) {
  cl_ulong start;
  clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
                          sizeof(start), &start, NULL);
  cl_ulong end;
  clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
                          sizeof(end), &end, NULL);
  printf("Event %d: start=%llu, end=%llu", i, start, end);
}

4. Analyzing the output

Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:

size_t resolution;
clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
                sizeof(resolution), &resolution, NULL);

FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.

reima
  • 2,076
  • 15
  • 22
  • Thanks a lot. I will try this out and let you know about the results. Just yesterday as I was googling out, I found that there is something called device fission which enables us to create subdevices on a single device and we can execute the kernels on different subdevices. AMD devices support device fission for sure but still not sure if NVIDIA devices support it. Am still reading about it.. – Nike Aug 03 '12 at 16:08
  • Hi, I recently used these steps to test if my kernels are running concurrently. I am using an AMD A10 APU which i believe supports concurrent kernel execution. But the profiling results tells me that the kernels are executing one after the other and not concurrently. Is there some specific option that i need to enable to get concurrent kernel execution working? I have posted my code and output at the following link : http://stackoverflow.com/questions/35341061/concurrent-kernel-execution-not-working-in-amd-a10-apu – Johns Paul Feb 12 '16 at 06:20
  • This method won't actually work. First, queues are created in-order by default (as in this example), but even if you pass the out of order flag, the runtime may decide to run your kernels sequentially. Second, OpenCL command queues may or may not be mapped to hardware queues depending on your runtime, so even using separate queues may give you sequential execution. – Mokosha Apr 26 '16 at 22:12
  • @Mokosha I think you misunderstood the purpose of the example. It shows a way to determine **if** the runtime executes some kernels concurrently when putting them on multiple queues, which I believe is what the OP wanted to know. It never indented to show **how** to actually make/force the runtime to do that. Regarding the `CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE` flag you mentioned: as per the spec, this only affects reordering of commands within the same queue, not the interaction between multiple queues. – reima Apr 27 '16 at 10:09
1

You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table and a plot dedicated to overlapped execution of said events.

I developed this library in order to, among other things, simplify the process described in the other answers. You can see a basic usage example here.

faken
  • 6,572
  • 4
  • 27
  • 28
0

Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.

Eric Bainville
  • 9,738
  • 1
  • 25
  • 27