Matrix manipulation using CUDA

Question

I am trying to write a program for matrix calculations using C/CUDA. I have the following program:

In main.cu

#include <cuda.h>
#include <iostream>
#include "teste.cuh"
using std::cout;

int main(void)
{
 const int Ndofs = 2;
 const int Nel   = 4;
 double *Gh   = new double[Ndofs*Nel*Ndofs*Nel];
 double *Gg;
 cudaMalloc((void**)& Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  Gh[ii] = 0.;
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice);
 integraG<<<256, 256>>>(Nel, Gg);
 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  cout << ii  + 1 << " " << Gh[ii] << "\n";
 return 0;
}

In mtrx.cuh

#ifndef TESTE_CUH_
#define TESTE_CUH_

__global__ void integraG(const int N, double* G)
{

    const int szmodel = 2*N;
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    int idy = threadIdx.y + blockIdx.y*blockDim.y;
    int offset = idx + idy*blockDim.x*gridDim.x;
    int posInit = szmodel*offset;

    G[posInit + 0] = 1;
    G[posInit + 1] = 1;
    G[posInit + 2] = 1;
    G[posInit + 3] = 1;
}

#endif

The result (which is supposed to be a matrix filled with 1's) is copied back to the host array; The problem is: nothing happens! Apparently, my program is not calling the gpu kernel, and I am still getting an array full of zeros.

I am very new to CUDA programming and I am using CUDA by example (Jason Sanders) as a reference book.

My questions are:

What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?

It is going to be very hard to help you if you don't ask a question and don't show us a short complete example of the code causing the problem. Are you sure you have a working CUDA installation? — talonmies, Jul 23 '15 at 14:56
The questions didn't appear in the post. Sorry. Just edited it. — Gabs, Jul 23 '15 at 16:56
your code does not compile, post a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). In addition to that, add [proper CUDA error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api/14038590#14038590). also be aware of the order of the parameters of [`cuMemcpy`](http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDA__MEM_g8d0ff510f26d4b87bd3a51e731e7f698.html). — m.s., Jul 23 '15 at 18:17
It does compile... Just added the headers and the "return 0". — Gabs, Jul 23 '15 at 18:41

score 2 · Accepted Answer · edited May 23 '17 at 12:21

These are your questions:

What is wrong with my code?

Is this the best way to deal with matrices using GPU, using matrices vectorized form?

Is there another reference that can provide more examples on matrices using GPU's?

For your first question. First of all, your problem should explicitly be defined. What do you want to do with this code? what sort of calculations do you want to do on the Matrix?

Try to check for errors properly THIS is a very good way to do so. There are some obvious bugs in your code as well. some of your bugs:

You're passing the wrong address pointers to the cudaMemcpy, the pointers that are passed to the source and the destination have to be swapped with each other, Check here

Change them to:

"NdofsNelNdofs*Nel" shows that you're interested in the value of the first 64 numbers of the array, so why calling 256 Threads and 256 blocks?
This part of your code:

int idx = threadIdx.x + blockIdx.xblockDim.x; int idy = threadIdx.y + blockIdx.yblockDim.y;

shows that you want to use 2-Dim threads and blocks; to do that so, you need to use Dim type.

By making the following changes:

 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice); //HERE
 dim3 block(2,2); //HERE
 dim3 thread(4,4); //HERE
 integraG<<<block, thread>>>(Nel, Gg); //HERE
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost); //HERE

You'll get a result like the following:

Anyway, if you state your problem and goal more clearly, better suggestions can be provided for you.

Regarding to your last two questions:

In my opinion CUDA C PROGRAMMING GUIDE and CUDA C BEST PRACTICES GUIDE are the two must documents to read when starting with CUDA, and they include examples on Matrix calculations as well.

Thank you! I am new to this kind of programming. Just another question: How do I update my iterators so I can get the maximum performance of the GPU? I mean, for one dimensional array, I should do _idx += blockDim.x*gridDim.x;_ In 2D arrays I must do the same? — Gabs, Jul 24 '15 at 11:29
Using 2D/3D threads/blocks is a way of representation of the thread and block indexes. For your question, for example If you are using 10 Blocks and 100 threads (totally 1000 threads) on x dimension, and want to perform vector addition on 2000 elements then one iteration is required and the index of the threads have to increase by blockDim.x*gridDim.x (the same thing applies to the y and z dimension). The question is couldn't you initiate 2000 threads from the beginning to avoid this serialization in your Kernel? Read those documents and know your GPU to provide you some insights on this. — Iman, Jul 24 '15 at 13:45
In fact I would like to make a more generic code in which any matrix can be built, regardless of size. I believe that the number of threads and blocks should be determined as functions of the dimensions of these matrices, am I correct? I wonder how exactly I can do this. — Gabs, Jul 27 '15 at 18:10

Matrix manipulation using CUDA

1 Answers1