Cuda program for Matrix addition

Question

I am trying to make a very simple program in order to perform matrices addition. I divided the code in two files, a main.cu file and a matrix.cuh header file. The code is:

At main.cu:

#include <iostream>
#include <cuda.h>

#include "Matriz.cuh"

using std:: cout;

int main(void)
{

    Matriz A;
    Matriz B;
    Matriz *C = new Matriz;
    int lin = 10;
    int col = 10;

    A.lin = lin;
    A.col = col;
    B.lin = lin;
    B.col = col;
    C->lin = lin;
    C->col = col;
    C->matriz = new double[lin*col];

    A.matriz = new double[lin*col];
    B.matriz = new double[lin*col];

    for (int ii = 0; ii < lin; ii++)
        for (int jj = 0; jj < col; jj++)
        {
            A.matriz[jj*A.lin + ii] = 1./(float)(10.*jj + ii + 10.0);
            B.matriz[jj*B.lin + ii] = (float)(jj + ii + 1);
        }

    somaMatriz(A, B, C);

    for (int ii = 0; ii < lin; ii++)
    {
        for (int jj = 0; jj < col; jj++)
            cout << C->matriz[jj*C->lin + jj] << " ";
        cout << "\n";
    }

    return 0;

}

At matrix.cuh:

#include <cuda.h>
#include <iostream>
using std::cout;

#ifndef MATRIZ_CUH_
#define MATRIZ_CUH_

typedef struct{
    double *matriz;
    int    lin;
    int    col;
} Matriz;

__global__ void addMatrix(const Matriz A, const Matriz B, Matriz C)
{
    int idx = threadIdx.x + blockDim.x*gridDim.x;
    int idy = threadIdx.y + blockDim.y*gridDim.y;

    C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];
}

void somaMatriz(const Matriz A, const Matriz B, Matriz *C)
{
    Matriz dA;
    Matriz dB;
    Matriz dC;

    int BLOCK_SIZE = A.lin;

    dA.lin = A.lin;
    dA.col = A.col;
    dB.lin = B.lin;
    dB.col = B.col;
    dC.lin = C->lin;
    dC.col = C->col;

    cudaMalloc((void**)&dA.matriz, dA.lin*dA.col*sizeof(double));
    cudaMalloc((void**)&dB.matriz, dB.lin*dB.col*sizeof(double));
    cudaMalloc((void**)&dC.matriz, dC.lin*dC.col*sizeof(double));

    cudaMemcpy(dA.matriz, A.matriz, dA.lin*dA.col*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(dB.matriz, B.matriz, dB.lin*dB.col*sizeof(double), cudaMemcpyHostToDevice);

    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(dA.lin/dimBlock.x, dA.col/dimBlock.y);

    addMatrix<<<dimGrid, dimBlock>>>(dA, dB, dC);

    cudaMemcpy(C->matriz, dC.matriz, dC.lin*dC.col*sizeof(double), cudaMemcpyDeviceToHost);
    cudaFree(dA.matriz);
    cudaFree(dB.matriz);
    cudaFree(dC.matriz);

   return;
}

#endif /* MATRIZ_CUH_ */

What I am getting: Matrix C is filled with ones, no matter what I do. I am using this program to get an idea on how to work with matrices with variable size in a GPU program. What is wrong with my code?

In printing C you have `cout << C->matriz[jj*C->lin + jj] << " ";` (both jj) probably wanted `cout << C->matriz[jj*C->lin + ii] << " "` and in `addMatrix` for A and B you have `lin*idx` and C has `lin*idy`, but I suspect they should be the same. — ryanpattison, Jul 27 '15 at 17:50
Ok... I changed those lines, but I am still getting some weird results: I am trying with 10x10 matrices and the resulting C is a matrix full of one's except for the last two rows and two columns where it results in a 2x2 real number sub matrix. — Gabs, Jul 27 '15 at 18:05

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

Any time you're having trouble with a CUDA code, it's good practice to do proper cuda error checking and run your code with cuda-memcheck. When I run your code with cuda-memcheck, I get the indication that the kernel is trying to do out-of-bounds read operations. Since your kernel is trivially simple, it means that your indexing calculations must be incorrect.

Your program needs at least 2 changes to get it working for small square matrices:

The index calculations in the kernel for A, B, and C should all be the same:

C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];

like this:

C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idy + idx] + B.matriz[B.lin*idy + idx];

Your x/y index creation in the kernel is not correct:

int idx = threadIdx.x + blockDim.x*gridDim.x;
int idy = threadIdx.y + blockDim.y*gridDim.y;

they should be:

int idx = threadIdx.x + blockDim.x*blockIdx.x;
int idy = threadIdx.y + blockDim.y*blockIdx.y;

With the above changes, I was able to get rational looking output.

Your setup code also doesn't appear to be handling larger matrices correctly:

int BLOCK_SIZE = A.lin;
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(dA.lin/dimBlock.x, dA.col/dimBlock.y);

You probably want something like:

int BLOCK_SIZE = 16;
...
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid((dA.lin + dimBlock.x - 1)/dimBlock.x, (dA.col + dimBlock.y -1)/dimBlock.y);

With those changes, you should add a valid thread check to your kernel, something like this:

__global__ void addMatrix(const Matriz A, const Matriz B, Matriz C)
{
    int idx = threadIdx.x + blockDim.x*blockIdx.x;
    int idy = threadIdx.y + blockDim.y*blockIdx.y;

    if ((idx < A.col) && (idy < A.lin))
      C.matriz[C.lin*idy + idx] = A.matriz[A.lin*idx + idy] + B.matriz[B.lin*idx + idy];
}

I also haven't validated that you are comparing all dimensions correctly against appropriate row or lin limits. That is something else to verify for non-square matrices.

edited May 23 '17 at 11:58

Community

1
1

answered Jul 27 '15 at 18:28

Robert Crovella

143,785
11
213
257

Thank you, very much! Actually, I am following the NVidia Cuda C Programming Guide, and there is no memory cheking in the examples. Also, the idx, idy and dimGrid values were obtained following the examples of the NVidia Guide (see pages 22 - 25). – Gabs Jul 27 '15 at 18:37
If I am dealing with huge matrices, how do I update the indexes of the matrices? I guess it is not the same as for one dimensional arrays. – Gabs Jul 27 '15 at 18:38
I'm looking at the CUDA C Programming Guide (PG-02829-001_v7.0) for CUDA 7.0, on p22, and I don't see anything that looks like your code. They generate a linear index without using `gridDim` (which was a key error in your code) and the kernel on page 22 (doing vector add) passes the vector length and checks the thread index variable against that length. The grid size calculation I see on p22 looks like mine, not yours. In general the programming guide does not demonstrate proper error checking for every example, for the sake of brevity of presentation of the concepts. – Robert Crovella Jul 27 '15 at 18:50
Regarding your question about calculation for large arrays, the `idx` and `idy` calculations I have shown would not need to change. The entire kernel should be able to handle arbitrarily large dimensions. – Robert Crovella Jul 27 '15 at 18:51
At page 25: // Invoke kernel dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); MatMulKernel<<>>(d_A, d_B, d_C); – Gabs Jul 27 '15 at 18:53
OK so that particular example will only work correctly when the matrix size/dimensions are evenly divisible by `BLOCK_SIZE`. That means your code as written would have the same limitation. The shared memory version of that code becomes considerably more complicated without making this simplification, and the simplification is besides the point that is being communicated there. Is there a question? I've already pointed out how to make adjustments for it. – Robert Crovella Jul 27 '15 at 19:00
Looking at Cuda By Example (Jason Sanders), at page 72, he defines the variable _int offset = x + y * blockDim.x * gridDim.x;_ . Is it really necessary or I can deal with large matrices without it? – Gabs Jul 27 '15 at 19:39
Is memory checking *really* necessary? (I work with numerical methods and such low level procedures sounds crazy to me!) Aren't the pointers (idx, idy) automatically set up by the CUDA API during their declaration? – Gabs Jul 27 '15 at 19:42
idx and idy are not pointers. According to your code they are integer variables that are created by you, not automatically by the CUDA API. And depending on a what a code is doing, they may need to be created in different fashions. Just because you can find a code somewhere, in some book that defines a variable called `offset` in a different way, doesn't mean it's a suitable method for your code. gridDim variables are typically used in codes that subdivide the work in some fashion, perhaps into tiles or grid-striding loops. Your kernel does niether (and as written it would make no sense). – Robert Crovella Jul 27 '15 at 20:07

Cuda program for Matrix addition

1 Answers1