Why is this simple OpenCL kernel running so slowly?

Question

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:

__kernel void copy(
  const __global char* pSrc, 
  __global __write_only char* pDst, 
  int length)
{
  const int tid = get_global_id(0);

  if(tid < length) {
    pDst[tid] = pSrc[tid];
  }
}

I've created the buffers in the following way:

char* out = new char[2048*2048];
cl::Buffer(
  context,
  CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
  length,
  out);

Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:

cl::Event event;
queue.enqueueNDRangeKernel(
  kernel, 
  cl::NullRange,
  cl::NDRange(length),
  cl::NDRange(1), 
  NULL, 
  &event);

event.wait();

On average, the time is around 75 milliseconds, as calculated by:

cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";

I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?

EDIT:

At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!

What implementation is this? The Intel OpenCL implementation? Have you tried the same kernel with arrays of 4-floats? That might be a better memory access pattern. — Ani, Oct 11 '12 at 19:16
Yeah, the Intel OpenCL implementation. I havn't tried 4-floats, that's a good thought. I'll look into it. — , Oct 11 '12 at 19:17
Out of curiousity, how does `clEnqueueCopyBuffer` perform? Is that what you mean by "direct"? — willglynn, Oct 11 '12 at 19:40
@willglynn Much much better. The profiling information reports the time as around 2 milliseconds. Wrapping the call to event.wait() in QueryPerformanceCounters reports the time a little higher, around 8 millis, but of course that includes the overhead of the wait call itself. — , Oct 11 '12 at 19:46
@willglynn But what I meant as "direct" was a memcopy call. That's a little apples-to-oranges, I know, but it at least provides some sort of baseline. — , Oct 11 '12 at 19:47

score 3 · Accepted Answer · edited May 23 '17 at 12:25

I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.

Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.

cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
  kernel, 
  cl::NullRange,
  cl::NDRange(length),
  cl::NDRange(myOptimalGroupSize), 
  NULL, 
  &event);

event.wait();

You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.

nat chouf · Answer 2 · 2012-10-12T10:34:34.503

CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).

In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization. The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block. The GPU approach is orthogonal: each thread participates in the copy of the same block. This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).

To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.

score 1 · Answer 3 · answered Oct 14 '12 at 23:30

I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

Why is this simple OpenCL kernel running so slowly?

3 Answers3