How to use NDrange in practice?

Question

I have read documentation and books (also these posts: OpenCL: query number of processing elements ; Understanding work-items and work-groups ; OpenCL: Work items, Processing elements, NDRange) about the execution model and and theory about data partitioning with NDrange.

Do I build my work-items and work-groups based on my hardware? If yes how can I query how many work-items and work-groups are available on a device? Is there a good practice how to divide work-items and work-groups to achieve a good performance?
I would like to know how do they work and interact in practice, for computation of one dimensional array and for two-dimensional array like an image.

score 1 · Answer 1 · edited May 26 '14 at 19:58

Good partitioning requires knowledge of your GPU hardware. For example, let's look on AMD cards like Radeon 6970. Overall number of cores is 1536. They are packed in 24 SIMD units. Each unit consists of 16 stream processors with VLIW4 architecture. So, we have 16 * 4 (because of VLIW4) * 24 = 1536 cores. Every SIMD unit share some resources (caches, etc) for all cores within it. Hence, a good size for local group in case of Radeon 6970 is some multiple of 64. You can query your OpenCL Device for number of Computing Units. In our case, you should get 24. So, for OpenCL on Radeon 6970 Computing Unit = SIMD Unit. Please, take into account that manual partitioning may cause performance drops on devices with different architecture.
A good example of local group benefits can be found on Nvidia developer zone. Take a look at the bitonic sort sample code, which will show you how to use local groups.

1 Answers1