Using local workers in OpenCL for large matrix computation

Question

I'm a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#, and am currently working on an application that computes a large 3D matrix. At each pixel in the matrix, 192 unique values are computed and then summed to yield the final value for that pixel. So, functionally, it is like a 4-D matrix, (161 x 161 x 161) x 192.

Right now I'm calling the kernel from my host code like this:

//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}

//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);

Sample kernel code is posted below.

__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
    int N = 161; //Size of matrix edges
    int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
    int array_id = get_global_id(1); //The location within the otherArray


    //Finding the x,y,z values of the pixel_id.
    float3 p;
    p.x = pixel_id % N;    
    p.y = ((pixel_id % (N*N))-p.x)/N;
    p.z = (pixel_id - p.x - p.y*N)/(N*N);

    float result;

    //...
    //Some long calculation for 'result' involving otherArray and p...
    //...

    BigMatrix[pixel_id] += result;
}

My code currently works, however I'm looking for speed for this application, and I'm not sure if my worker/group setup is the best approach (i.e. 161*161*161 and 192 for dimensions of the worker pool).

I've seen other examples of organizing the global worker pool into local worker groups to increase efficiency, but I'm not quite sure how to implement that in OpenCL.NET. I'm also not sure how this is different than just creating another dimension in the worker pool.

So, my question is: Can I use local groups here, and if so how would I organize them? In general, how is using local groups different than just calling an n-dimensional worker pool? (i.e. calling Execute(args, new int[]{(N*N*N),192}), versus having a local workgroup size of 192?)

Thanks for all the help!

Are the values in BigMatrix computed against any other values in BigMatrix? How is 'p' used in the calculation? Can you give any more information about the computation you're trying to do? — mfa, May 01 '12 at 11:12
Sure. The values of BigMatrix are not used in the calculation, only the indices. The values of BigMatrix are initially 0, and set to the result of the computation. The calculation uses the indices of the current pixel within BigMatrix (p.x,p.y,p.z) to find the vector to another point specified by a value in the otherArray. Therefore, each calculation is unique, as each pixel has a unique vector to each of the 192 points in otherArray. The magnitude and the distance of this vector are used in the final calculation for the final value in BigMatrix. — superwillis, May 02 '12 at 01:26

boiler96 · Answer 1 · 2012-05-02T06:17:52.290

1

I have a few suggestions for you:

I think your code has a race condition. Your last line of code has the same element of BigMatrix being modified by multiple different work items.
If your matrix is truly 161x161x161, there is plenty of work items here to use those dimensions as your only dimensions. You already have > 4 million work items, which should be plenty of parallelism for your machine. You don't need 192 times that. Plus, if you don't split the computation of an individual pixel into multiple work items, you won't need to synchronize the final add.
If your global work size is not a nice multiple of a big power of 2, you might try to pad it out so that it becomes one. Even if you pass NULL as your local work size, some OpenCL implementations choose inefficient local sizes for global sizes that don't divide well.
If you don't need local memory or barriers for your algorithm, you can pretty much skip local workgroups.

Hope this helps!

edited May 02 '12 at 06:17

answered Apr 30 '12 at 06:44

boiler96

1,167
1
8
12

Thanks for the reply. I like the idea of using atomic_add, however it seems to be only for type int. My calculation has to be a floating point calculation, so I need to be able to do a synchronized add involving floats. Is there any alternative to atomic_add that can add floats? – superwillis May 02 '12 at 01:30
Ugh. Good catch. No, there's no support for floating point atomics in OpenCL. Given that, I would really consider just launching 161x161x161 work items. – boiler96 May 02 '12 at 06:16
1

#2 I agree. unrolling the 192 loop is a bit overkill. #3 Alternately, compute the largest round global work size you can, and farm out the remaining work to a CPU kernel. #4 I disagree on this point. I will post my solution; it relies on locals to speed things up greatly. – mfa May 02 '12 at 13:25
I don't dispute that using locals could make this faster. My statement #4 says if you don't use locals or barriers, don't fret too much over workgroups. – boiler96 May 02 '12 at 14:50
Right. I guess it's not a requirement. – mfa May 02 '12 at 15:24

score 1 · Accepted Answer · edited May 23 '17 at 11:49

I think a lot of performance is lost waiting on memory access. I have answered a similar SO question. I hope my post helps you out. Please ask any questions you have.

Optimizations:

The big boost in my version of your kernel comes from reading otherArray into local memory.
each work item computes 4 values in BigMatrix. This means they can be written at the same time, on the same cacheline. There is minimal loss of parallelism because there are still > 1M work items to execute.

...

#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
    //using 1 quarter of the total size of the matrix
    //this work item will be responsible for computing 4 consecutive values in BigMatrix
    //also reduces global size to (N^3)/4  ~= 1043000 for N=161

    int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
    int pixel_id;
    //array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely

    int local_id = get_local_id(0); //work item id within the group
    int local_size = get_local_size(0); //size of group


    float result[4]; //result cached for 4 global values
    int i, j;
    float3 p;

    //cache the values in otherArray to local memory
    //now each work item in the group will be able to read the values efficently
    //each element in otherArray will be read a total of N^3 times, so this is important
    //opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
    __local float otherValues[otherSize];
    for(i=local_id; i<otherSize; i+= local_size){
        otherValues[i] = otherArray[i];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    //now this work item can compute the complete result for pixel_id 
    for(j=0;j<4;j++){
        result[j] = 0;
        pixel_id = global_id + j;

        //Finding the x,y,z values of the pixel_id.
        //TODO: optimize the calculation of p.y and p.z
        //they will be the same most of the time for a given work item
        p.x = pixel_id % N;    
        p.y = ((pixel_id % Nsqr)-p.x)/N;
        p.z = (pixel_id - p.x - p.y*N)/Nsqr;

        for(i=0;i<otherSize;i++){
            //...
            //Some long calculation for 'result' involving otherValues[i] and p...
            //...
            //result[j] += ...
        }
    }
    //4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
    BigMatrix[global_id] += result[0];
    BigMatrix[global_id + 1] += result[1];
    BigMatrix[global_id + 2] += result[2];
    BigMatrix[global_id + 3] += result[3];
}

Notes:

Global work size needs to be a multiple of four. Ideally, a multiple of 4*workgroupsize. This is because there is no error checking to see if each pixel_id falls within the range: 0..N^3-1. Unprocessed elements can be crunched by the cpu while you wait for the kernel to execute.
The work group size should be fairly large. This will force the cached values to be used more heavily and the benefit of caching the data in LDS will grow.

There is a further optimization to be done with the calculation of p.x/y/z in order to avoid too many costly division and modulo operations. see code below.

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)   {
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id = global_id;

int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group


float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//Finding the initial x,y,z values of the pixel_id.
p.x = pixel_id % N;    
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;

//cache the values here. same as above...

//now this work item can compute the complete result for pixel_id 
for(j=0;j<4;j++){
    result[j] = 0;
//increment the x,y,and z values instead of computing them all from scratch
    p.x += 1;
    if(p.x >= N){
        p.x = 0;
        p.y += 1;
        if(p.y >= N){
            p.y = 0;
            p.z += 1;
        }
    }

    for(i=0;i<otherSize;i++){
        //same i loop as above...
    }
}

Thanks for the great reply! I do have questions however, as I can't seem to get your setup working with my code: 1) Looking at your code, each worker thread will create a new cached "otherValues" matrix, but I don't understand why the size of the cached array is still 192...aren't you only filling in (192/local_size) elements? I think the rest of the elements would be null, right? 2) Similarly, why do you loop through all 192 elements in the final for-loop if only some values are available? I guess I'm confused as to what the cache is really accomplishing in terms of local vs global workers. — superwillis, May 03 '12 at 05:44
The local array of 192 floats is created and shared between the entire work group. The for loop which copies the data starts at 'local_id' which will be different for each work item in the group. Then it loops with i+=local_size, to cover the case where there are less than 192 work items in a group. So if you have a work group size of 192, each work item will copy exactly one element into the otherValues. the mem_fence line makes the group wait until all of the values are copied before entering the computation-loop over all 192 elements. — mfa, May 03 '12 at 12:26
otherValues is not to be confused with result[4]. otherValues is shared between all work items in the group. result is a private array created by each work item to store its four results at the same time, with the sole purpose of delaying the global write operation until 4 consecutive floats are ready to write. — mfa, May 03 '12 at 12:29

Using local workers in OpenCL for large matrix computation

2 Answers2