Optimizing batched matrix multiplication opencl code

Question

Below is an opencl kernel which performs blocked matrix multiplication for multiple independent matrices. selectMatrixA and selectMatrixB store multiple matrices (same size and square matrices) in row major order.

// Matrix multiplication: C = A * B.


#define BLOCK_SIZE 20
#define MATRIX_SIZE 100 * 100

#define BLOCK_DIMX 5 // Number of blocks in the x dimension

__kernel void
batchedMatrixMul(__global float *selectMatrixC, __global float *selectMatrixA, __global   
float *selectMatrixB, int wA, int wB)
{
    // Block index
    int bx = get_group_id(0);
    int by = get_group_id(1);


    __global float *C = selectMatrixC + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *A = selectMatrixA + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *B = selectMatrixB + (bx/BLOCK_DIMX) * MATRIX_SIZE;



    int tx = get_local_id(0);
    int ty = get_local_id(1);

    float Csub = 0;

    // Identify the row and column of the C matrix to work on

    int Row = (by * BLOCK_SIZE)  + ty;
    int Col = ((bx %(BLOCK_DIMX)) * BLOCK_SIZE) + tx;

    // Declaration of the local memory array As used to store the sub-matrix of A
    __local float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the local memory array Bs used to store the sub-matrix of B
    __local float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Loop over all the sub-matrices of A and B required to compute the block sub-matrix
    for (int m = 0; m < wA / BLOCK_SIZE; ++m) 
    {

        // Load the matrices from global memory to local memory. Each thread loads one   
        //element of each matrix
        As[ty][tx] = A[Row * wA + m * BLOCK_SIZE + tx];
        Bs[ty][tx] = B[(m * BLOCK_SIZE + ty)*wA + Col];

        // Synchronize to make sure the matrices are loaded
        barrier(CLK_LOCAL_MEM_FENCE);

        // Multiply the two matrices together each thread computes one element of the block 
        //sub-matrix
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];

        // Synchronize to make sure that the preceding computation is done before loading 
        //two new sub-matrices of A and B in the next iteration
        barrier(CLK_LOCAL_MEM_FENCE);

    }

    // Write the block sub-matrix to device memory each thread writes one element
    C[Row * wA + Col] = Csub;

}

Here is how I launch the kernel:

localWorkSize[0] = BLOCK_SIZE;
localWorkSize[1] = BLOCK_SIZE;

// for a 100 X 100 matrix, MATRIX_DIMX = MATRIX_DIMY = 100
globalWorkSize[0] = MATRIX_DIMX * NUM_MATRICES;
globalWorkSize[1] = MATRIX_DIMY ;

cl_event             event;
errcode = clEnqueueNDRangeKernel(clCommandQueue, 
          clKernel, 2, NULL, globalWorkSize, 
          localWorkSize, 0, NULL, &event);

Below are some performance numbers when running this on an NVIDIA Grid K520:

1. matrix size:100 X 100 . Number of matrices = 20000. Time taken for multiplication = 
0.262 seconds. As shown in the code, the block size was set to 20. Block size of 10 was 
slower. This calculates to around 152 GFLOPS

2. matrix size: 10000 X 10000. Number of matrices = 1. Time taken for multiplication = 10.6 
seconds. Here also the block size was 20. Using a block size of 50 is not possible due to   
the size of the local memory.

Can someone please help me to understand why the code is running slow, and why 2. is so much slower than 1. I am new to OpenCL, and I am wanting to learn how to optimize code based on the underlying architectural details.

have you tried a simple dot product approach to solve the problem and get a baseline performance measurement? maybe get each work item to compute a single element in C by doing a 100 dot 100 operation. — mfa, Sep 22 '14 at 12:53
is it a requirement to solve a complete multiplication operation with a single work group, or can it be done with multiple groups? — mfa, Sep 22 '14 at 15:31

score 5 · Answer 1 · edited May 23 '17 at 12:32

The reason why your first test is so much faster is because there is a difference in the amount of work each test is doing. Actually, a factor of 50x.

Big-O for square matrix multiplication is O(n^3). See: why is the time complexity of square matrix multiplication defined as O(n^3)? As a result, the 10k squared matrix actually takes 1 million times more work to multiply than a single 100x100 multiplication. Your 20000 executions of the 100x100 multiplication does not make up for the massive amount of work needed to multiply the large matrices once.

Matrix multiplication is just many dot-products.Your algorithm only breaks the dot products into groups to easy handling, and does not use any special tricks to reduce the numbers in my calculations below.

For your small matrix test:

Total dot products: 10^4
MADs per dot product: 10^2
Total matrix-multiply operations: 20000 = 2*10^4
Total multiply-adds: 2* 10^(4+2+4) = 2*10^10 = 20,000,000,000

20 Billion.

Large matrix test:

Total dot products: 10^8
MADs per dot product: 10^4
Total multiply operations: 1 (or 10^0)
Grand total multiply-adds: 10 ^ (8 + 4 + 0) = 10^12 = 1,000,000,000,000

1000 Billion.

Your 10000x10000 test was technically running faster -- crunching 50x more operations in only 40x more run time.

Read more about 'special tricks' here: http://en.wikipedia.org/wiki/Strassen_algorithm. Even though this algorithm is not considered practical for GPU computing. Mor complicated algorithms also exist, but the brute-force approach on graphics hardware seems to be used most often.

Why is your kernel running slowly in general? There are a number of different optimizations you can use to speed things up. Below are just a few you can google and experiment with yourself. You will probably come across some I haven't mentioned here.

Optimize work group and block sizes. see opencl PREFERRED_WORK_GROUP_SIZE
Use float4 datatype. opencl includes a dot product function which computes dot product for floatn datatypes.
Transpose matrix B before running the kernel. you can use another kernel to do the transposition.

score 1 · Answer 2 · answered Sep 23 '14 at 19:39

In my opinion the reason why 2. is so much slower is that access pattern of matrix multiplication is not so cache friendly. If you need to get first value of first row and first value of second row, they are stored to memory very far away from each other. If matrix size increases, they are stored even further away from each other. This will lead to lot of cache misses.

I don't have any personal experience on matrix multiplication, but I just thought that it might be possible to store your data in Z-order curve to achieve more cache friendly pattern. From the references of Wikipedia it seems something like that has been done by Valsalam & al 2002.

Another quick fix, I would try before using lot of time to Z-ordering, is to use private variables and get rid of barriers. Even it requires more loads from global memory, it might be possible that compiler can do much better optimization to that code.

Optimizing batched matrix multiplication opencl code

2 Answers2