I'm trying to profile different implementations of a simple convolution. I've obtained several results on differents CPU (i5, xeon, etc...) and now i'm trying on a intel HD4000, via intel beignet.
I use clock_gettime on the host side and CL_QUEUE_PROFILING_QUEUE and events on the device side. A stripped down version of the code would be:
clock_gettime(CLOCK_REALTIME, &start);
err = clEnqueueNDRangeKernel(queue, img_conv_kernel, 2, NULL,
&global_ws[0], &local_ws[0], 0, NULL, &event_clock);
if (err)
die("can not launch kernel %d\n", err);
/* profiling */
clWaitForEvents(1, &event_clock);
clGetEventProfilingInfo(event_clock, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &cl_start, NULL);
clGetEventProfilingInfo(event_clock, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &cl_stop, NULL);
clock_gettime(CLOCK_REALTIME, &end);
printf("%f %f ", double) (cl_stop - cl_start) * 1e-6,
time_elapsed(start, end));
/* read data */
clock_gettime(CLOCK_REALTIME, &start);
err = clEnqueueReadBuffer(queue, res_d, CL_TRUE, 0, N*sizeof(float),
res_h, 0, NULL, NULL);
clock_gettime(CLOCK_REALTIME, &end);
printf("%f ", time_elapsed(start, end));
/* C implementation */
clock_gettime(CLOCK_REALTIME, &start);
conv(img_data, res_h, &sobel_gx[0][0], k, k);
clock_gettime(CLOCK_REALTIME, &end);
printf("%f\n", time_elapsed(start, end));
The results are :
231.592960 16.701613 3.995006 151.874017
/* (device / host / reading-data / basic-c implementation )*/
What I don't understand is the kernel execution time is actually greater than the mesured cpu-time via clock_gettime, yet according to [0] I use clWaitForEvents() to make sure the kernel is fully executed.