I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE
of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE
for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE
is 32 KiB, so I think I can rule out __local
storage.
What other factors (eg, registers/__private
memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE
, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo()
which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation
in surf.cl
. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.