My application takes 5200ms for computation of a data set using OpenCL on GPU, 330ms for same data using OpenCL on CPU; while the same data processing when done without OpenCL on CPU using multiple threads takes 110ms. The OpenCL timing is done only for kernel execution i.e. start just before clEnqueueNDRangeKernel
and end just after clFinish
.
I have a Windows gadget which tells me that I am only using 19% GPU power. Even if I could make it to 100% still it would take ~1000ms which is much higher than my CPU.
The work group size is a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
and I am using all computation units (6 for GPU and 4 for CPU). Here is my kernel:
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);
myreal pCoef = coef[gid];
myreal pRow = row[gid];
pCoef = pCoef - (pRow * ratio);
coef[gid] = pCoef;
}
I am getting similar poor performance for another kernel:
__kernel void calcURatios(__global myreal *ratios, __global myreal *rhs, myreal c, myreal r)
{
size_t gid = get_global_id(0);
myreal pRatios = ratios[gid];
myreal pRHS = rhs[gid];
pRatios = pRatios / c;
ratios[gid] = pRatios;
//pRatios = pRatios * r;
pRHS = pRHS - (pRatios * r);
rhs[gid] = pRHS;
}
Questions:
- Why is my GPU performing so poor compared to CPU on OpenCL.
- Why is CPU on OpenCL 3X slower than CPU without OpenCL but multi threaded?