OpenCL kernel performing very poor?

Question

My application takes 5200ms for computation of a data set using OpenCL on GPU, 330ms for same data using OpenCL on CPU; while the same data processing when done without OpenCL on CPU using multiple threads takes 110ms. The OpenCL timing is done only for kernel execution i.e. start just before clEnqueueNDRangeKernel and end just after clFinish. I have a Windows gadget which tells me that I am only using 19% GPU power. Even if I could make it to 100% still it would take ~1000ms which is much higher than my CPU.

enter image description here

The work group size is a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and I am using all computation units (6 for GPU and 4 for CPU). Here is my kernel:

__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
    size_t gid = get_global_id(0);

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

I am getting similar poor performance for another kernel:

__kernel void calcURatios(__global myreal *ratios, __global myreal *rhs, myreal c, myreal r)
{
    size_t gid = get_global_id(0);

    myreal pRatios = ratios[gid];
    myreal pRHS = rhs[gid];

    pRatios = pRatios / c;
    ratios[gid] = pRatios;

    //pRatios = pRatios * r;
    pRHS = pRHS - (pRatios * r);
    rhs[gid] = pRHS;
}

Questions:

Why is my GPU performing so poor compared to CPU on OpenCL.
Why is CPU on OpenCL 3X slower than CPU without OpenCL but multi threaded?

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

Maybe you could add some information about how you enqueue this kernel - maybe with an inappropriate local work size? (In doubt, just pass null as the local work size - OpenCL will choose an appropriate one).

But even in the best case, it's unlikely that you will see a speedup here. The computation that you are doing there is heavily memory-bound. In the first kernel, you are reading two elements from global memory, then performing a trivial subtraction/multiplication, and afterwards writing an element to global memory (and in the second kernel, it's not much different). The bottleneck here is simply not the computation, but the data transfer.

(BTW: Recently, I wrote a few general words about that in https://stackoverflow.com/a/22868938 ).

Maybe the new developments of Unified Memory, HSA, AMD Kaveri etc. will come for the rescue here, but this is still in an early stage.

EDIT: Maybe you could also describe in which context you are performing these computations. If you have further computations (kernels) that work on the results of this kernel, maybe they could be combined on order to improve the memory/computation ratio.

edited May 23 '17 at 12:21

Community

1
1

answered Apr 09 '14 at 16:04

Marco13

53,703
9
80
159

Hi Marco thanks for putting your points. You mentioned that I am doing trivial computation and the bottlencek is data transfer. But my requirement is only this much computation. So my application is not suitable for OpenCL then? – Cool_Coder Apr 09 '14 at 16:08
@Cool_Coder This was a very general remark: It's unlikely to achieve a good speedup for such a simple computation. Additionally, you mentioned that you did not yet include the memory transfers in the timing - so it would become even worse. Maybe someone else knows something that I do *not* know, and give you hints that somehow make it possible to achieve a speedup in this particular case, but I have never seen a speedup from using the GPU for such a task. – Marco13 Apr 09 '14 at 16:14
I have added another question which details how I am choosing the local and global workgroup size. Can you please provide your valuable suggestion? http://stackoverflow.com/questions/22968369/get-optimum-local-global-workgroup-size-in-opencl – Cool_Coder Apr 09 '14 at 16:20
I am using OpenCL only for the single kernel as the computation done by that kernel was previously a bottleneck in my multi threaded version of application. So I decided to go for performing the computation on GPU, but turns out the GPU is not of help for the task in hand for me...:((( – Cool_Coder Apr 09 '14 at 16:24
Don't give up yet. Maybe someone else here has a brilliant idea. – Marco13 Apr 09 '14 at 16:27
As Marco13 mentioned, you are limited by memory transfers, however I don't expect that the speed will be so low. What are your work group sizes? If you don't know them, simply use "NULL". Trying to guess out is typically a bad idea and can even lead you to 10x and 20x speed decreases if you do it wrong. – DarkZeros Apr 09 '14 at 16:43
@Marco13 I wish I had seen that answer a couple of months ago, then I would not have wasted 2 months implementing the code. Anyways now I understand that not every data parallel bottleneck can be solved by OpenCL.... :((( – Cool_Coder Apr 10 '14 at 15:12
@Cool_Coder: As mentioned above (and in http://stackoverflow.com/questions/22968369/get-optimum-local-global-workgroup-size-in-opencl/22969485#22969485 ) you should probably not yet give up. Maybe your problem can be re-formulated so that it benefits from the GPU (this is not really a "profound" statement, because I don't know exactly what you want to compute - but it would be a pity if you had wasted 2 months...) – Marco13 Apr 11 '14 at 22:34

OpenCL kernel performing very poor?

1 Answers1

Linked