how to generalize square matrix multiplication to handle arbitrary dimensions

Question

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200.

#include <stdio.h>
#include <cuda.h>
#define width 1000

__global__ void kernel(int *a,int *b,int *c)
{

        int tx = threadIdx.x + blockIdx.x*blockDim.x;
        int ty = threadIdx.y + blockIdx.y*blockDim.y;

        int sum=0,k;

        for(k=0;k<(width);++k)
        {
                sum += a[ty*width +k]*b[k*width + tx];
        }
        c[ty*width + tx] = sum;
}


int main()
{
        int a[width*width],c[width*width],b[width*width];
        int *dev_a,*dev_b,*dev_c;
        int i,count=0;
        int size = (width*width)*sizeof(int);

         for(i=0;i<(width*width);i++)
        {
                a[i] = 1;
                b[i] = 1;
        }

        cudaMalloc((void **)&dev_a,size);
        cudaMalloc((void **)&dev_b,size);
        cudaMalloc((void **)&dev_c,size);

        cudaMemcpy(dev_a,&a,size,cudaMemcpyHostToDevice);
        cudaMemcpy(dev_b,&b,size,cudaMemcpyHostToDevice);

        dim3 dimBlock(20,20);
        dim3 blockID(50,50);

        kernel<<<blockID,dimBlock>>>(dev_a,dev_b,dev_c);

        cudaMemcpy(&c,dev_c,size,cudaMemcpyDeviceToHost);

        for(i=0;i<(width*width);i++)
        {
                count++;
                if(count == (width+1))
                {
                        count = 1;
                        printf("\n");
                }

                printf("%d ",c[i]);
        }
        printf("\n");
        return 0;
}

Yeah the answer Robert Crovella gave that's what I was looking for @talonmies and it was my first question on stackoverflow. — mayview, Jun 19 '15 at 15:26

score 1 · Accepted Answer · edited May 23 '17 at 12:06

This code will work for very specific dimensions but not for others.

It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks - 50 in the code you have shown).

So when width is 20*50 (1000) it will work as shown. But if I change width to some other value (say 800) and make no other changes, your code won't work. In the case of 800, however, I could get your code working by changing the grid dimension from 50 to 40, then width = 800 = 20 *40.

But what if I need to multiply two matrices of width 799? I can't come up with a product of grid and block dimension that will match that width exactly.

This is a fairly standard problem in CUDA programming - I cannot come up with convenient block and grid dimensions to exactly match my work (i.e. data) size, and if I launch too many (threads/blocks) things don't seem to work.

To fix this problem we must do 2 things:

Be sure to launch at least enough, but maybe more than enough threads (blocks of threads) to cover the entire data set
Add conditional code in the kernel, so that only the threads corresponding to valid data do any real work.

To address item 1 above, we modify our grid dimension calculations to something like this:

    dim3 dimBlock(16,16);
    dim3 blockID((width+dimBlock.x-1)/dimBlock.x,(width+dimBlock.y-1)/dimBlock.y);

To address item 2 above we modify our kernel code to condition thread behavior on whether or not the thread corresponds to valid data:

__global__ void kernel(int *a,int *b,int *c, int mwidth)
{

        int tx = threadIdx.x + blockIdx.x*blockDim.x;
        int ty = threadIdx.y + blockIdx.y*blockDim.y;
        if ((tx<mwidth)&&(ty<mwidth)){

          int sum=0,k;

          for(k=0;k<(mwidth);++k)
          {
                sum += a[ty*mwidth +k]*b[k*mwidth + tx];
          }
          c[ty*mwidth + tx] = sum;}
}

And since we've modified the kernel with a new parameter, we have to pass that parameter on invocation:

    kernel<<<blockID,dimBlock>>>(dev_a,dev_b,dev_c, width);

That should be what is needed to logically extend the code you have shown to handle "arbitrary" dimensions. I would also suggest adding proper cuda error checking any time you are having trouble with a CUDA code.

But as I defined width using #define, the parameter "width" I don't think need to be passed. — mayview, Jun 19 '15 at 15:34
That is correct. It doesn't need to be passed as a kernel parameter. The more general approach I have shown will work whether `width` is a compiler macro or is an ordinary program variable (although I should have called it something else - I've edited the post to rename it.) — Robert Crovella, Jun 19 '15 at 15:36

how to generalize square matrix multiplication to handle arbitrary dimensions

1 Answers1