12

OpenCL standard defines the following options to get info about device and compiled kernel:

  • CL_DEVICE_MAX_COMPUTE_UNITS

  • CL_DEVICE_MAX_WORK_GROUP_SIZE

  • CL_KERNEL_WORK_GROUP_SIZE

  • CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Given this values, how can I calculate the optimal size of work group and number of work groups?

Kentzo
  • 3,881
  • 29
  • 54

2 Answers2

9

You discover these values experimentally for your algorithm. Use a profiler to get hard numbers.

I like to use CL_DEVICE_MAX_COMPUTE_UNITS as the number of work groups, because I often rely on synchronizing work items. I usually run kernels with little branching, so the take the same time to execute in each compute unit.

Some multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will be optimal for your device. What that multiple actually is depends on your memory access pattern and type of work you are doing with each work item. Use 1 as the multiple when you are running a heavy, compute-bound (ALU) kernel. Try a larger multiple to hide memory latency if you are bottlenecked by memory access. Use a profiler to determine when your access time and your ALU time are optimal.

Optimal ratio for ALU to fetch is 1:1 for any device. This is rarely achieved in practice, so you want to keep the ALU/SIMD banks saturated. This means ALU:fetch should be greater than 1 whenever possible. Less than 1 means you should try a larger work group size to better hide the memory latency.

mfa
  • 5,017
  • 2
  • 23
  • 28
  • I'm targeting to support a range of devices. Does this mean, I have to test my kernels on each of them to get optimal values for kernel enqueuing? – Kentzo Apr 11 '12 at 05:08
  • Test out your algorithm on the devices you have access to -- the results shouldn't vary too much. I suggest trying it on one device from each major architecture you want to target. If you are able to, adjust the params at runtime to try optimizing. This could tweak the optimal values you discovered during development. Getting feedback from the end user / client about actual hardware numbers will let you focus improvements on the most common devices. – mfa Apr 11 '12 at 11:07
  • In general using `CL_DEVICE_MAX_COMPUTE_UNITS` won't give you optimal performance (unless maybe you do a lot of synchronization between workgroups, but that is generally a bad idea anyways). I would generally ask the documentation for good values, but I've never seen more workgroups hurting performance, so the more the marrier. Note that the part about choosing higher workgroupsizes to hide memory latency is (at least for gpus) only true if you don't use enough workgroups (like CL_DEVICE_MAX_COMPUTE_UNITS, since CUs can typically sustain more then one workgroup at a time). – Grizzly Apr 12 '12 at 13:43
  • @Grizzly I know that CL_DEVICE_MAX_COMPUTE_UNITS as number of workgroups is a bad idea. I use it as multiplier. E.g. 10 * CL_DEVICE_MAX_COMPUTE_UNITS. I'm still interested in runtime-based methods to determine preferred work group size and count, because I typically have to enqueue dozens of subtasks within one main task. – Kentzo Apr 15 '12 at 08:15
0

As mfa said, you have to discover these experimentally. I wanted to add that depending on what you are computing (particularly size of the jobs, i.e. smaller or larger for each work item), sometimes a good try can be:

  • Lots of work items with small work groups and each job item being small.
  • Less work items with larger work groups and each job item being larger.

That is, basically check base cases and figure out how it affects the processing pipeline.

In essence you have to tweak it. I often execute several times for different parameters (profile it) and then generate a surface plot to see how it behaves.