Below is an opencl kernel which performs blocked matrix multiplication for multiple independent matrices. selectMatrixA and selectMatrixB store multiple matrices (same size and square matrices) in row major order.
// Matrix multiplication: C = A * B.
#define BLOCK_SIZE 20
#define MATRIX_SIZE 100 * 100
#define BLOCK_DIMX 5 // Number of blocks in the x dimension
__kernel void
batchedMatrixMul(__global float *selectMatrixC, __global float *selectMatrixA, __global
float *selectMatrixB, int wA, int wB)
{
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);
__global float *C = selectMatrixC + (bx/BLOCK_DIMX) * MATRIX_SIZE;
__global float *A = selectMatrixA + (bx/BLOCK_DIMX) * MATRIX_SIZE;
__global float *B = selectMatrixB + (bx/BLOCK_DIMX) * MATRIX_SIZE;
int tx = get_local_id(0);
int ty = get_local_id(1);
float Csub = 0;
// Identify the row and column of the C matrix to work on
int Row = (by * BLOCK_SIZE) + ty;
int Col = ((bx %(BLOCK_DIMX)) * BLOCK_SIZE) + tx;
// Declaration of the local memory array As used to store the sub-matrix of A
__local float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the local memory array Bs used to store the sub-matrix of B
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of A and B required to compute the block sub-matrix
for (int m = 0; m < wA / BLOCK_SIZE; ++m)
{
// Load the matrices from global memory to local memory. Each thread loads one
//element of each matrix
As[ty][tx] = A[Row * wA + m * BLOCK_SIZE + tx];
Bs[ty][tx] = B[(m * BLOCK_SIZE + ty)*wA + Col];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together each thread computes one element of the block
//sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding computation is done before loading
//two new sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory each thread writes one element
C[Row * wA + Col] = Csub;
}
Here is how I launch the kernel:
localWorkSize[0] = BLOCK_SIZE;
localWorkSize[1] = BLOCK_SIZE;
// for a 100 X 100 matrix, MATRIX_DIMX = MATRIX_DIMY = 100
globalWorkSize[0] = MATRIX_DIMX * NUM_MATRICES;
globalWorkSize[1] = MATRIX_DIMY ;
cl_event event;
errcode = clEnqueueNDRangeKernel(clCommandQueue,
clKernel, 2, NULL, globalWorkSize,
localWorkSize, 0, NULL, &event);
Below are some performance numbers when running this on an NVIDIA Grid K520:
1. matrix size:100 X 100 . Number of matrices = 20000. Time taken for multiplication =
0.262 seconds. As shown in the code, the block size was set to 20. Block size of 10 was
slower. This calculates to around 152 GFLOPS
2. matrix size: 10000 X 10000. Number of matrices = 1. Time taken for multiplication = 10.6
seconds. Here also the block size was 20. Using a block size of 50 is not possible due to
the size of the local memory.
Can someone please help me to understand why the code is running slow, and why 2. is so much slower than 1. I am new to OpenCL, and I am wanting to learn how to optimize code based on the underlying architectural details.