0

I am doing a matrix multiplication in CUDA. The following setup works:

int TILE = 8;
dim3 DimGrid((numCColumns - 1)/TILE + 1, (numCRows - 1)/TILE + 1, 1);
dim3 DimBlock(TILE, TILE, 1);

But if I use one block for the whole image, it returns all zero. What is the reason for that? Assume one block can contain the whole image ( input is 64x64).

dim3 DimGrid(1,1,1);
dim3 DimBlock(numCColumns, numCRows, 1);

This is how I call kernel in the main function:

matrixMultiply<<<DimGrid, DimBlock>>>(deviceA, deviceB, deviceC,
                                        numARows, numAColumns,
                                        numBRows, numBColumns,
                                        numCRows, numCColumns);

and the kernel:

__global__ void matrixMultiply(float * A, float * B, float * C,
                   int numARows, int numAColumns,
                   int numBRows, int numBColumns,
                   int numCRows, int numCColumns) {
    //@@ Insert code to implement matrix multiplication here
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;

    if ((Row < numCRows) && (Col < numCColumns))
    {
        float value = 0.0;
        for (int i = 0; i < numAColumns; i++)
            value += A[Row * numAColumns + i] * B[i*numBColumns + Col];
        C[Row * numCColumns + Col] = value;
    }
}
Dzung Nguyen
  • 3,794
  • 9
  • 48
  • 86

1 Answers1

3

But if I use one block for the whole image, it returns all zero. What is the reason for that?

A CUDA threadblock is limited to a maximum of 1024 threads (refer to "Maximum number of threads per block "). For a multidimensional threadblock, this means the product of the dimensions must be less than or equal to 1024 (for cc2.x and newer GPUs.)

For even a 64x64 image, this would not work:

dim3 DimBlock(numCColumns, numCRows, 1);

since numCColumns * numCRows is greater than 1024.

If you do proper cuda error checking in your code, you'll get an indication of this (that your kernel launch is failing due to an invalid kernel configuration parameter).

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257