Issue in OpenCL Kernel function

Question

I am new to Open-cl and I am trying to write kernel code for the following matrix operation:

A is a 2X2 matrix:
A = [1  2] ----> row1
    [3  4] ----->row2

I need to compute: 
1) s1 = transpose(row1) X row1
2) s1 = transpose(row2) X row2
3) Sum = s1+s2

Example

I wrote kernel code for row level (i.e I can do transpose(row1) X row1 ) -this serves the purpose for first row only

How do I use parallelism to compute this for each row and find the final sum within kernel function ?

private static String programSource1 =
            "__kernel"+
            " void matrixMul(__global float* A, __global float* C,  int rowLength)"+
            "{"+
                "int row = get_global_id(1);"+
                "int col = get_global_id(0);"+              
                    "C[row*rowLength+col] = A[col] * A[row];"+

            "}";

Is the final implementation going to work only on 2x2 array of matrices or bigger matrices? This changes completely how to implement the code.... — DarkZeros, Oct 14 '13 at 15:29

Michael Haidl · Answer 1 · 2013-10-15T06:13:36.493

#define MAX_ROW_LENGTH 2 // or more 
__kernel void matrixMul(__global float* A, __global float* C,  
                                                 int rowLength)
{
     __local float buffer[MAX_ROW_LENGTH  * MAX_ROW_LENGTH];
     __local float s1[MAX_ROW_LENGTH * MAX_ROW_LENGTH];

    int col = get_global_id(0);
    int row = get_global_id(1);
    int rows = get_global_size(1);

    // read the matrix from global to local memory
    buffer[row * rowLength + col] = A[row * rowLength + col]; 
    s1[row * rowLength + col] = 0.0f;

    barrier(CLK_LOCAL_MEM_FENCE);

    for (int i = 0; i < rows; ++i)
    {
        s1[row * rowLength + col] += 
                buffer[i * rowLength + col] * buffer[i * rowLength + row];
    }
    C[row * rowLength + col] = s1[row*rowLength+col];
}

Here is some kernel code that does what you want for small matrices. The kernel uses local memory to reduce global memory access. For such small problems (2x2 matrix) this want achiev anything but if you are computing greater matrices this can speedup the thing a little bit. However, this is a short example and not optimized.Iit comes with some limitations:

this code only supports local workgroup sizes equal to the global workgroup size (no chunks)
if your matrices get to big the shared memory will limit the utilization of your GPU and
if your matrices get realy big their will not be enough shared memory

If you don't want local memory remove replace the calls for buffer within the for loop by A and write directly to C instead of s1.

Thanks a lot , excellent explanation. This gave me a clear idea about writing kernel code. — Rohit Sarewar, Nov 04 '13 at 10:00

Issue in OpenCL Kernel function

1 Answers1