I'm encountering an unexpected performance with my OpenCL code (more precisely, I use boost::compute 1.67.0). For now, I just want to add each elements of 2 buffers c[i] = a[i] + b[i]. I noticed some speed reduction in comparison of an existing SIMD implementation so I isolated each step to highlight which one is time consuming. Here is my code sample :
Chrono chrono2;
chrono2.start();
Chrono chrono;
ipReal64 elapsed;
// creating the OpenCL context and other stuff
// ...
std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
__kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
{
size_t idx = get_global_id(0);
out[idx] = in1[idx] + in2[idx];
}
);
boost::compute::program* program = new boost::compute::program;
try {
chrono.start();
*program = boost::compute::program::create_with_source(kernel_src, context);
elapsed = chrono.elapsed();
std::cout << "Create program : " << elapsed << "s" << std::endl;
chrono.start();
program->build();
elapsed = chrono.elapsed();
std::cout << "Build program : " << elapsed << "s" << std::endl;
}
catch (boost::compute::opencl_error& e) {
std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
return;
}
boost::compute::kernel* kernel = new boost::compute::kernel;
try {
chrono.start();
*kernel = program->create_kernel("add_knl");
elapsed = chrono.elapsed();
std::cout << "Create kernel : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
// Pass the argument to the kernel
kernel->set_arg(0, bufIn1);
kernel->set_arg(1, bufIn2);
kernel->set_arg(2, bufOut);
elapsed = chrono.elapsed();
std::cout << "Set args : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
return;
}
try {
chrono.start();
queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
elapsed = chrono.elapsed();
std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
return;
}
std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;
chrono.start();
delete program;
elapsed = chrono.elapsed();
std::cout << "Delete program : " << elapsed << "s" << std::endl;
delete kernel;
elapsed = chrono.elapsed();
std::cout << "Delete kernel : " << elapsed << "s" << std::endl;
And here is a sample of result (I run my program on a NVidia GeForce GT 630, with NVidia SDK TookKit) :
Create program : 0.0013123s
Build program : 0.0015421s
Create kernel : 6.6e-06s
Set args : 1.7e-06s
Kernel calculation : 0.0001639s
[Function] Full duration : 0.0077794
Delete program : 4.1e-06s
Delete kernel : 0.0879901s
I know my program is simple and I don't expect having the kernel execution being the most time consumming step. However, I thought the kernel deletion would take only a few ms, such as creating or building the program.
Is this a normal behaviour?
Thanks