Matrix multiplication using CUDA:Wrong output

Question

I'm a beginner in CUDA. I am writing a program to multiply two matrices without using shared memory. Here's my program where I multiply 4x4 matrices filled with 1.

The output is 26853932 where the correct output should be 4.

Can someone please tell me where I am wrong. Maybe I've made a very naive mistake?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include<stdlib.h>  

//kernel deifnition

__global__ void mulKernel(int *d_M,  int *d_N,  int *d_P,int width)
{
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;


if (row < width && col < width)
{
    int pvalue=0;
    for (int k = 0; k < width; k++)
    {
        pvalue = pvalue + (d_M[row*width + k] * d_N[k*width + col]);
    }

    d_P[row*width + col] = pvalue;
}
}

int main()
{
const int block_size = 2;
const int array_width = 4;
int h_M[array_width][array_width] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };

int h_N[array_width][array_width] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
int h_P[array_width][array_width];

int size = array_width*array_width*(sizeof(int));
int *d_M, *d_N, *d_P;

//memory allocation
cudaMalloc((void**)&d_M, size);
cudaMalloc((void**)&d_N, size);
cudaMalloc((void**)&d_P, size); 

//copy data from host to memory
cudaMemcpy(d_M, h_M, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_N, h_N, size, cudaMemcpyHostToDevice);

dim3 grid(array_width/block_size, array_width/block_size, 0);       //grid size
dim3 block(block_size, block_size, 0);  //block size

mulKernel << <grid, block >> >(d_M,d_N,d_P,array_width);

cudaMemcpy(h_P, d_P, size, cudaMemcpyDeviceToHost);

printf("%d", h_P[0][0]);

printf("Press enter to exit....\n");
getchar();


}

The error is that you set the z-extent of your block and grid to `0`. A 2D grid is a 3D grid where one dimension is of length 1. If you leave out the last argument to `dim3` it is automatically set to `1`. Thus, just erase `,0` and your code works. For future posts: It is not necessary to include a completely irrelevant picture of the output. Instead write the wrong output **and** the expected output in text. — havogt, Mar 22 '16 at 10:10
@havogt if you want to write an answer I would upvote. The use of `0` as you indicate would actually throw a CUDA runtime error, which the posted code is not checking for. — Robert Crovella, Mar 22 '16 at 13:25

score 2 · Answer 1 · edited May 23 '17 at 12:15

The problem is in the lines

dim3 grid(array_width/block_size, array_width/block_size, 0);       //grid size
dim3 block(block_size, block_size, 0);  //block size

where the grid extent in z direction is set to 0. The correct representation of a 2D object in a 3D grid is to set the extent in one of the directions to 1.

The code works after replacing the 0 with 1 or by leaving out the 3rd argument (then it is default initialized to 1):

dim3 grid(array_width/block_size, array_width/block_size );
dim3 block(block_size, block_size );

With the incorrect setup the runtime error invalid configuration argument is thrown on the kernel call. You could have easily found that yourself by using proper CUDA error checking (or by running your program with cuda-memcheck).

Thanks man, how naive on my part to make such a silly mistake.Upvoted! — Prashant Pandey, Mar 22 '16 at 19:00

yogesh_desai · Answer 2 · 2016-03-22T16:33:40.097

The initialization of GRID & BLOCK configuration is wrong.

Now:

dim3 grid(array_width/block_size, array_width/block_size, 0);       //grid size
dim3 block(block_size, block_size, 0);  //block size

Expected:

dim3 grid(array_width/block_size, array_width/block_size, 1);       //grid size
dim3 block(block_size, block_size, 1);  //block size

Its good habit to use CUDA Error Statements. Following is the very easy example, just replace stmt with your expected code Statement.

#define wbCheck(stmt) do{
cudaError_t err = stmt;
if (err != cudaSuccess) {
printf( "Failed to run stmt %d ", __LINE__);
printf( "Got CUDA error ...  %s ", cudaGetErrorString(err));
return -1;
}
} while(0)

Matrix multiplication using CUDA:Wrong output

2 Answers2