Time recquired for OpenCL kernel deletion

Question

I'm encountering an unexpected performance with my OpenCL code (more precisely, I use boost::compute 1.67.0). For now, I just want to add each elements of 2 buffers c[i] = a[i] + b[i]. I noticed some speed reduction in comparison of an existing SIMD implementation so I isolated each step to highlight which one is time consuming. Here is my code sample :

    Chrono chrono2;
    chrono2.start();
    Chrono chrono;
    ipReal64 elapsed;

    // creating the OpenCL context and other stuff
    // ...
    
    std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
        __kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
    {
        size_t idx = get_global_id(0);
        out[idx] = in1[idx] + in2[idx];
    }
    );

    boost::compute::program* program = new boost::compute::program;
    try {
        chrono.start();
        *program = boost::compute::program::create_with_source(kernel_src, context);
        elapsed = chrono.elapsed();
        std::cout << "Create program : " << elapsed << "s" << std::endl;
        chrono.start();
        program->build();
        elapsed = chrono.elapsed();
        std::cout << "Build program : " << elapsed << "s" << std::endl;
    }
    catch (boost::compute::opencl_error& e) {
        std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
        return;
    }

    boost::compute::kernel* kernel = new boost::compute::kernel;
    try {
        chrono.start();
        *kernel = program->create_kernel("add_knl");
        elapsed = chrono.elapsed();
        std::cout << "Create kernel : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
        return;
    }

    try {
        chrono.start();
        // Pass the argument to the kernel
        kernel->set_arg(0, bufIn1);
        kernel->set_arg(1, bufIn2);
        kernel->set_arg(2, bufOut);
        elapsed = chrono.elapsed();
        std::cout << "Set args : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
        return;
    }

    try {

        chrono.start();
        queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
        elapsed = chrono.elapsed();
        std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
        return;
    }
    
    std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;

    chrono.start();
    delete program;
    elapsed = chrono.elapsed();
    std::cout << "Delete program : " << elapsed << "s" << std::endl;

    delete kernel;
    elapsed = chrono.elapsed();
    std::cout << "Delete kernel  : " << elapsed << "s" << std::endl;

And here is a sample of result (I run my program on a NVidia GeForce GT 630, with NVidia SDK TookKit) :

Create program           : 0.0013123s
Build program            : 0.0015421s
Create kernel            : 6.6e-06s
Set args                 : 1.7e-06s
Kernel calculation       : 0.0001639s
[Function] Full duration : 0.0077794
Delete program           : 4.1e-06s
Delete kernel            : 0.0879901s

I know my program is simple and I don't expect having the kernel execution being the most time consumming step. However, I thought the kernel deletion would take only a few ms, such as creating or building the program.

Is this a normal behaviour?

Thanks

score 1 · Accepted Answer · answered Nov 20 '20 at 15:31

1

I'll point out that I've never used boost::compute, but it looks like it's a fairly thin wrapper over OpenCL, so the following should be correct:

Enqueueing the kernel does not wait for it to complete. The enqueue function returns an event, which you can then wait for, or you can wait for all tasks enqueued onto the queue to complete. You are timing neither of those things. What is likely happening is that when you destroy your kernel, it waits for all queued instances which are still pending to complete before returning from the destructor.

answered Nov 20 '20 at 15:31

pmdj

22,018
3
52
103

Thanks a lot! It's a stupid mistake I made... By waiting the event termination, I do reach around 0.08-0.09 seconds for the kernel calculation. What bothers me is that my plain CPU implementation takes only 0.04 seconds (I apply the kernel on a 15000x15000 image). Are some optimization possible on a simple kernel like this one? I found https://stackoverflow.com/questions/56144284/troubles-with-slow-speeds-in-opencl, I will study it and see. – ractiv Nov 20 '20 at 17:36
Adding 2 vectors is going to be hard to further speed up, as it's likely to be memory bandwidth bound. You need to perform more computation in your kernel to hide the limitations of memory access. – pmdj Nov 21 '20 at 11:33

Time recquired for OpenCL kernel deletion

1 Answers1