Non-square matrix multiplication in CUDA

Question

For my GPU programming class, we've been tasked with completing certain parts of a non-square matrix multiplication program. Specifically, the kernel function and initializing the thread block and kernel grid dimensions.

I've based my code on the CUDA C Programming Guide's matrix multiplication code, but instead of using structs as they do, I have modified mine to use only the parameters given (since we're not allowed to change parameters). We are provided with the 3 matrices A, B, and C, as well as the dimensions of them- m x k, k x n, and m x n, respectively. Where the struct used A.height, I've used dimension m, where it used B.width, I've used dimension n, etc.

I've run into several problems, the first of which is that my program doesn't pass the included test, which verifies the correctness of the product matrix C. I assume that there is something wrong in my matrix multiplication code, then, and that the issue probably arises from me adapting the struct code.

#include <stdio.h>
__global__ void mysgemm(int m, int n, int k, const float *A, const float *B,
        float* C) {

    /********************************************************************
     *
     * Compute C = A x B
     *   where A is a (m x k) matrix
     *   where B is a (k x n) matrix
     *   where C is a (m x n) matrix
     *
     ********************************************************************/

    // INSERT KERNEL CODE HERE
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    for (int e = 0; e < k; ++e){
        Cvalue += (A[row * k + e]) * (B[e * n + col]);
    }
    C[row * n + col] = Cvalue;
}

My other problem, which I'm even less sure about, involves the code to initialize the thread block and kernel grid dimensions.

// Initialize thread block and kernel grid dimensions ---------------------
    const unsigned int BLOCK_SIZE = 16; // Use 16x16 thread blocks
//INSERT CODE HERE
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(n / dimBlock.x, m / dimBlock.y);
// Invoke CUDA kernel -----------------------------------------------------
//INSERT CODE HERE
    mysgemm<<<dimGrid, dimBlock>>>(m, n, k, A, B, C);

I understand dimBlock, but I don't understand dimGrid, and don't have a proper idea of what to use as parameters for it. When I run the code as is, the kernel won't even launch if the matrix I pass in doesn't have a dimension that is a power of 2. And if I do use a power of 2, the test still fails.

I apologize if I've been too wordy. This is my first post and I wanted to give as many details as possible. Hopefully someone can help walk me through these issues.

There are plenty of questions about cuda matrix multiplication, with nearly every possible variant considered. Like [this one](http://stackoverflow.com/questions/18815489/cuda-tiled-matrix-matrix-multiplication-with-shared-memory-and-matrix-size-whic) for example. Perhaps you should review some of the questions that have already been asked for ideas/hints/clues. — Robert Crovella, Sep 25 '13 at 06:43

score 5 · Answer 1 · edited May 23 '17 at 10:28

The following kernel I'm posting below is a variant of the one I posted in

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

in that it does not use shared memory.

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

The two if statements in the kernel are the if statements mentioned in the answer by Eric.

For the sake of your convenience, I'm posting the full code below:

#include <stdio.h>
#include <math.h>
#include <conio.h>

#define TILE_DIM 16                     // Tile dimension
#define DIMX 373                            
#define DIMY 242
#define DIMZ 533

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

int main() {

    int CCols = DIMZ, CRows=DIMX, ACols=DIMY, ARows=DIMX, BCols=DIMZ, BRows=DIMY;

    dim3 dimBlock(TILE_DIM, TILE_DIM, 1);
    dim3 dimGrid;

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

    float *deviceA, *deviceB, *deviceC;

    float* hostA    = (float*)malloc(DIMX*DIMY*sizeof(float));
    float* hostB    = (float*)malloc(DIMY*DIMZ*sizeof(float));
    float* hostC    = (float*)malloc(DIMX*DIMZ*sizeof(float));
    float* hostCp   = (float*)malloc(DIMX*DIMZ*sizeof(float));

    for (int x = 0; x<DIMX; x++)
        for (int y = 0; y<DIMY; y++) {
            hostA[x*DIMY+y] = rand()/(float)RAND_MAX;
            hostB[x*DIMY+y] = rand()/(float)RAND_MAX;
        }

    cudaMalloc((void **)&deviceA, DIMX*DIMY*sizeof(float));
    cudaMalloc((void **)&deviceB, DIMY*DIMZ*sizeof(float));
    cudaMalloc((void **)&deviceC, DIMX*DIMZ*sizeof(float));

    cudaMemcpy(deviceA, hostA, DIMX*DIMY*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(deviceB, hostB, DIMY*DIMZ*sizeof(float), cudaMemcpyHostToDevice);

    MatMulNoShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , ACols, BRows ,BCols , CRows , CCols);

    cudaMemcpy(hostC, deviceC, DIMX*DIMZ*sizeof(float), cudaMemcpyDeviceToHost);

    return 0;
}

Note that the two instructions

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

ensure a full tiled coverage of the matrices, as mentioned at point 1. of Eric's answer.

score 1 · Answer 2 · edited Feb 21 '16 at 05:39

Your code currently only works when m and n are multiples of 16, which is your block size.

Two things you can do now to make it work on arbitrary sizes.

Make the gird size large enough to cover the whole matrix C. Instead of using the floor of n/blockdim.x as you have done, you could use the ceil of that value by
```
 (n+blockdim.x-1)/blockdim.x
```
After you have done step 1, the matrix you are multiplying will be a little bit larger because of the ceiling operation. you could then limit the multiplying to the exact size of the result matrix C by adding an if clause in the kernel.

Please refer to CUDA docs for more details, especially the programming guide.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Non-square matrix multiplication in CUDA

2 Answers2