I'm a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#, and am currently working on an application that computes a large 3D matrix. At each pixel in the matrix, 192 unique values are computed and then summed to yield the final value for that pixel. So, functionally, it is like a 4-D matrix, (161 x 161 x 161) x 192.
Right now I'm calling the kernel from my host code like this:
//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}
//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);
Sample kernel code is posted below.
__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
int N = 161; //Size of matrix edges
int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
int array_id = get_global_id(1); //The location within the otherArray
//Finding the x,y,z values of the pixel_id.
float3 p;
p.x = pixel_id % N;
p.y = ((pixel_id % (N*N))-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/(N*N);
float result;
//...
//Some long calculation for 'result' involving otherArray and p...
//...
BigMatrix[pixel_id] += result;
}
My code currently works, however I'm looking for speed for this application, and I'm not sure if my worker/group setup is the best approach (i.e. 161*161*161 and 192 for dimensions of the worker pool).
I've seen other examples of organizing the global worker pool into local worker groups to increase efficiency, but I'm not quite sure how to implement that in OpenCL.NET. I'm also not sure how this is different than just creating another dimension in the worker pool.
So, my question is: Can I use local groups here, and if so how would I organize them? In general, how is using local groups different than just calling an n-dimensional worker pool? (i.e. calling Execute(args, new int[]{(N*N*N),192}), versus having a local workgroup size of 192?)
Thanks for all the help!