I have read documentation and books (also these posts: OpenCL: query number of processing elements ; Understanding work-items and work-groups ; OpenCL: Work items, Processing elements, NDRange) about the execution model and and theory about data partitioning with NDrange.
Do I build my work-items and work-groups based on my hardware? If yes how can I query how many work-items and work-groups are available on a device? Is there a good practice how to divide work-items and work-groups to achieve a good performance?
I would like to know how do they work and interact in practice, for computation of one dimensional array and for two-dimensional array like an image.