I'm doing some image processing using OpenCL.
For example, I used a 100*200 size image. In the .cl code, I just half the image pixel value by:
{
int width=get_group_id(0);
int height=get_group_id(1);
// col(width)
int x= get_global_id(0);
// row(height)
int y= get_global_id(1);
(unsigned char) data_output[x*width+y]=
(unsigned char)data_input[x*width+y]/2;
}
After the kernel's parameter setting I run the kernel by:
clEnqueueNDRangeKernel( queue,kernel_DIP,2,NULL,global_work_size,local_work_size, 0,NULL,NULL);
The global_work_size I used is the image size:
size_t global_work_size[2] = {100,200};
I found even the .cl code doesn't include some code like "get_local_id(0);"
The local_work_size did also have lots influence on the performance.
Both "size_t local_work_size[2]= {1,1};"(small local work size) and "size_t local_work_size[2]= {50,50};" (big work size) are slow.
some suitable size like below will be much faster:
size_t local_work_size[2]= {10,10};
So here is my question:
Why a code without get_local_id() also was influenced by the local memory?
How can I set the best local size to make it run in the highest speed?
I also tested the running speed on other platforms such as the freescale's IMX.6, it seems that the changed-size local work-size doesn't work there at all! So why?
If anyone know the answer, plz help. Thank you so much!