1

I'm trying to profile different implementations of a simple convolution. I've obtained several results on differents CPU (i5, xeon, etc...) and now i'm trying on a intel HD4000, via intel beignet.

I use clock_gettime on the host side and CL_QUEUE_PROFILING_QUEUE and events on the device side. A stripped down version of the code would be:

clock_gettime(CLOCK_REALTIME, &start);

err = clEnqueueNDRangeKernel(queue, img_conv_kernel, 2, NULL,
        &global_ws[0], &local_ws[0], 0, NULL, &event_clock);

if (err)
    die("can not launch kernel %d\n", err);

/* profiling */
clWaitForEvents(1, &event_clock);
clGetEventProfilingInfo(event_clock, CL_PROFILING_COMMAND_START, 
        sizeof(cl_ulong), &cl_start, NULL);
clGetEventProfilingInfo(event_clock, CL_PROFILING_COMMAND_END, 
        sizeof(cl_ulong), &cl_stop, NULL);

clock_gettime(CLOCK_REALTIME, &end);
printf("%f %f ", double) (cl_stop - cl_start) * 1e-6, 
            time_elapsed(start, end));

/* read data */
clock_gettime(CLOCK_REALTIME, &start);
err = clEnqueueReadBuffer(queue, res_d, CL_TRUE, 0, N*sizeof(float),
        res_h, 0, NULL, NULL);
clock_gettime(CLOCK_REALTIME, &end);

printf("%f ", time_elapsed(start, end));

/* C implementation */
clock_gettime(CLOCK_REALTIME, &start);
conv(img_data, res_h, &sobel_gx[0][0], k, k);
clock_gettime(CLOCK_REALTIME, &end);
printf("%f\n", time_elapsed(start, end));

The results are :

231.592960 16.701613 3.995006 151.874017
/* (device / host / reading-data / basic-c implementation )*/

What I don't understand is the kernel execution time is actually greater than the mesured cpu-time via clock_gettime, yet according to [0] I use clWaitForEvents() to make sure the kernel is fully executed.

[0] : https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro

jnbd
  • 47
  • 8
  • Could You please clarify what exactly are You asking and edit Your question title accordingly? – Kamiccolo Jun 03 '14 at 10:45
  • 1
    thx! i just fixed it. – jnbd Jun 03 '14 at 10:47
  • You call printf() 2 times. First you output 2 values, then 1 value. How do you get 4 values in stdout? Then, OpenCL profiling results are in nanoseconds, you convert it into milliseconds. What is precision of your clock_gettime, are you comparing milliseconds with milliseconds? – Roman Arzumanyan Jun 03 '14 at 12:10
  • Are you using OUT_OF_ORDER queue? The read might be done before the kernel finishes. – DarkZeros Jun 03 '14 at 12:20
  • @RomanArzumanyan yes I slighlty modified the output for readability. and yes the comparison is in ms, i use : `((end.tv_sec - start.tv_sec) * 1e3 + (end.tv_nsec - start.tv_nsec) * 1e-6) ` – jnbd Jun 03 '14 at 12:23
  • @DarkZeros no, I was actually unaware of this option. and i call *clWaitForEvents* _before_ the reading. – jnbd Jun 03 '14 at 12:25
  • Are you just trying to find out how long it takes? Do it in a loop 1000 times, see how many seconds it takes, and that's how many milliseconds one iteration takes. For microseconds, loop it a million times. And you don't need 9 digits of accuracy. 2 is plenty. If you want a rough idea why it takes that much time, [*just do this.*](http://stackoverflow.com/a/378024/23771) – Mike Dunlavey Jun 03 '14 at 14:29

1 Answers1

0

Please, run this code & show results.

static long Time_Elapsed(
    long start,
    long end)
{
    return end - start;
}

static long Get_CL_Time(
    cl_event event)
{
    cl_ulong start, end;

    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);     
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);

    return Time_Elapsed(start, end);
}

timespec start, end;
cl_event event_clock;

clock_gettime(CLOCK_REALTIME, &start);

/* run kernel */
err = clEnqueueNDRangeKernel(queue, img_conv_kernel, 2, NULL,
        &global_ws[0], &local_ws[0], 0, NULL, &event_clock);

clWaitForEvents(1, &event_clock);
long kernel_time = Get_CL_Time(event_clock);

/* read data */
err = clEnqueueReadBuffer(queue, res_d, CL_TRUE, 0, N*sizeof(float),
        res_h, 0, NULL, &event_clock);

clWaitForEvents(1, &event_clock);
long io_time = Get_CL_Time(event_clock);    

clock_gettime(CLOCK_REALTIME, &end);
long host_time = Time_Elapsed(start.tv_nsec, end.tv_nsec);

printf( "Kernel time: %l nanoseconds \n"
        "IO time:     %l nanoseconds \n"
        "Host time:   %l nanoseconds \n", 

        kernel_time,
        io_time,
        host_time);

/* C implementation */
clock_gettime(CLOCK_REALTIME, &start);
conv(img_data, res_h, &sobel_gx[0][0], k, k);
clock_gettime(CLOCK_REALTIME, &end);
host_time = Time_Elapsed(start.tv_nsec, end.tv_nsec);

printf("C implementation time: %l nanoseconds\n", host_time);
Roman Arzumanyan
  • 1,784
  • 10
  • 10
  • here it is sir : `Kernel time: 687377920 nanoseconds IO time: 53120 nanoseconds Host time: 53554859 nanoseconds C implementation time: 211760010 nanoseconds` – jnbd Jun 03 '14 at 13:57
  • What amount of data are you reading from Device? Kernel execution time & data reading times are way to different. – Roman Arzumanyan Jun 03 '14 at 14:07
  • yes, that's what startle me too. An HD image in float, so 1080*1920*sizeof(float) – jnbd Jun 03 '14 at 14:11
  • So, in that case your downspeed should be ((1920 * 1080 * 4) * 1e9 / 53120) / (1024 * 1024 * 1024) = 145 Gb/s, which is unbelievable fast. Are you sure that kernel isn't producing error? I mean not return status of NDRange, as it's a status of pushing into queue. – Roman Arzumanyan Jun 03 '14 at 14:15
  • just checked: I had a warning with a missing pragma (`#pragma OPENCL EXTENSION cl_khr_fp64 : enable`) but that's all (and it doesn't change a thing). Actually i use monochromatic images so it would only be 1920*1080*1 but still... – jnbd Jun 03 '14 at 14:21
  • Compare if your C & OpenCL code produce pixel-wise same output. Conformance shall always go before performance. If output isn't same, more source code is required. Kernel execution time looks sane. Usually, if kernel crashes, it produce event with broken time counters. – Roman Arzumanyan Jun 03 '14 at 14:25
  • ok yes output is different I was not testing it with the lastest version of the code, I'm going to take a look at it, thx. Still you do you explain how *kernel time* is greater than *host time* as the host has to wait for the kernel to finish... – jnbd Jun 03 '14 at 14:36
  • One of the reasons may be, as I said, invalid kernel, which produces wrong execution time. Another reason may be bug in beignet. Make several runs of your kernel to ckeck if execution time is close. If you provide source code, I may check it on my machine. – Roman Arzumanyan Jun 03 '14 at 14:43
  • ok, here you go : [c code](http://pastebin.com/qDnRiGEu) and [kernel file](http://pastebin.com/wXNcbc3k) – jnbd Jun 03 '14 at 14:52
  • I think the kernel is executing but segfaulting or something, which makes it run to the end in the GPU side (I don't think the kernel can be cancelled while it is running, due to SIMD ejecution model) but the control returns faster to the CPU side. – DarkZeros Jun 03 '14 at 15:39
  • I run your code on my notebook (though, it has AMD GPU), and it shows perfectly sane results: Kernel time: 2508.445000, IO time: 6984.296000, Host time: 41000.000000. I modified your code (checked only timers and return codes). It can be found on http://pastebin.com/cnY3cBWc. I use my own OpenCL wrapper, it can be found on https://github.com/RomanArzumanyan/SCOW – Roman Arzumanyan Jun 03 '14 at 16:04
  • ok thanks a lot for your help ! i'll go for some bug in beignet then. – jnbd Jun 03 '14 at 16:56