CUDA 2D array nvidia

Question

I am using cudaMallocPitch and cudaMemcpy2D for 2D array. I am not sure that I have coded correct even though I could not get the output correctly. Can any one help please? Can any one debug my error? Thanks in advance.

#include<stdio.h>
#include<cuda.h>
#define siz 4*sizeof(int)
__global__ void addmatrix(int *m1,int *m2,size_t pitch)
{
    int r=threadIdx.x;
    int *r1=m1+r*pitch;
    int *r2=m2+r*pitch;
    int c;
    for(c=1;c<=4;c++)
    {
        r1[c]+=r2[c];
    }
}
int main()
{
    int i,j;
    int **m1_c,**m2_c;
    int *m1_d,*m2_d;
    size_t pitch;
    cudaError_t err;
    m1_c=(int **)malloc(4*sizeof(int *));
    for(i=1;i<=4;i++)
    {
        m1_c[i]=(int *)malloc(siz);
    }
    m2_c=(int **)malloc(4*sizeof(int *));
    for(i=1;i<=4;i++)
    {
        m2_c[i]=(int *)malloc(siz);
    }
    for(i=1;i<=4;i++)
    {
        for(j=1;j<=4;j++)
        {
            m1_c[i][j]=rand()%10;
            m2_c[i][j]=rand()%10;
        }
    }
    for(i=1;i<=4;i++)
    {
        for(j=1;j<=4;j++)
        {
            printf("%d\t",m1_c[i][j]);
        }
        printf("\n");
    }
    printf("\n\n");
    for(i=1;i<=4;i++)
    {
        for(j=1;j<=4;j++)
        {
            printf("%d\t",m2_c[i][j]);
        }
        printf("\n");
    }
    err=cudaMallocPitch((void **)&m1_d,&pitch,siz,siz);
    err=cudaMallocPitch((void **)&m2_d,&pitch,siz,siz);
    err=cudaMemcpy2D(m1_d,pitch,m1_c,siz,siz,4,cudaMemcpyHostToDevice);
    err=cudaMemcpy2D(m2_d,pitch,m2_c,siz,siz,4,cudaMemcpyHostToDevice);
    dim3 grid(1);
    dim3 block(16);
    addmatrix<<<grid,block>>>(m1_d,m2_d,siz);
    cudaMemcpy2D(m1_c,siz,m1_d,pitch,siz,4,cudaMemcpyDeviceToHost);

    for(i=1;i<=4;i++)
    {
        for(j=1;j<=4;j++)
        {
            printf("%d\t",m1_c[i][j]);
        }
        printf("\n");
    }
    err=cudaFree(m1_d);
    err=cudaFree(m2_d);
    err=cudaDeviceReset();      
}

If you want help with your problem, you are going to have to provide a far better description of your problem. "could not get the output correctly" is not nearly enough information to help you. What, exactly, happens? What do you think should happen? Why are you not checking the return values of each CUDA API function? What CUDA version, GPU and operating system are you using? These are the sorts of things someone will need to know good answers to before they can answer your question — talonmies, Apr 04 '13 at 11:00
There are at least 2 problems with your code. You cannot pass a pointer to a 2D array to cudaMemcpy2D. You should read about [what it does](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g17f3a55e8c9aef5f90b67cdf22851375) and the types of parameters it expects. For cudaMemcpy2D, both pointers are pointers to memory, but you are passing one pointer to memory and one pointer to a pointer to memory. The second problem is that all of your array indexing is starting at 1 instead of 0. Perhaps you don't understand the basics of C array indexing. — Robert Crovella, Apr 04 '13 at 17:20

score 1 · Answer 1 · edited May 23 '17 at 10:32

So there are several problems with this code. In no particular order:

You are indexing through your various arrays from 1 to 4, but this is not correct in C. C indexing starts at zero and goes to one less than the dimension. This has nothing to do with CUDA.
cudaMemcpy2D expects two pointers (src and dst) both of which are pointers to linear arrays in memory. I realize this is confusing since 2D appears all over the description, but the two pointer parameters are fundamentally both of the same type (a pointer to memory) and you are passing pointers of 2 different types (one is a pointer to memory, the other is a pointer to a pointer to memory). So cleary from the definition of cudaMemcpy2D, your usage cannot be correct. There are plenty of answered questions with examples about how to use cudaMemcpy2D, I suggest you search and review some of them. Note that fixing this issue will probably cause you to fundamentally re-think how you want to store the data on the host matrices. There are plenty of questions such as this one about handling multidimensional matrices -- you should flatten them if possible. Note that in your current code, this error with cudaMemcpy2D usage is destroying the pointer array on your host matrices, which is resulting in a seg fault when you try to print the results.
Your parameters passed to cudaMallocPitch are not quite right. For both the width and height parameters you are passing siz which is the matrix dimension in bytes. But you should only pass the byte-dimension for the width parameter. For the height parameter you should pass the number of rows, i.e. 4 in your case. There is a similar requirement on the call to cudaMemcpy2D but you got it right there.
Now let's look at your kernel. In the invocation, you are launching a grid of one block of 16 threads. Since your matrices have 16 elements, that seems sensible. That implies a thread strategy where each thread will be responsible for a single element of the result. But looking at your kernel code, you have each thread computing the result of an entire row, i.e. 4 elements. There are 2 ways to fix this: you could either reduce your grid to 4 threads instead of 16 threads (simpler, probably, from a code modification standpoint), or you could re-write your kernel (eliminate the for-loop) and have each thread compute a single output element (which will probably do more work in parallel).
Additionally, in your kernel, you are using the pitch parameter in pointer-arithmetic based indexing. But remember that pitch is in bytes and for pointer-arithmetic indexing, the compiler expects the parameters to be in elements - it does the conversion to bytes for you, based on the data type. Again, this is really a C issue, and not specific to CUDA. You could fix this by using (pitch/sizeof(int)) wherever you are using pitch in the kernel.
You are passing siz for the pitch to your kernel. You should be passing pitch for the pitch parameter. siz is effectively the "pitch" on the host data storage, but pitch is the pitch of the storage on the device. The kernel is operating on the device storage, so it needs the correct pitch.
As a recommendation, do cuda error checking on all cuda API calls and kernel calls.

Here is some code which addresses all of the above issues, in one fashion or another:

#include<stdio.h>
#define siz (4*sizeof(int))

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void addmatrix(int *m1,int *m2,size_t pitch)
{
    int r=threadIdx.x;
    int *r1=m1+r*(pitch/sizeof(int));
    int *r2=m2+r*(pitch/sizeof(int));
    int c;
    for(c=0;c<4;c++)
    {
        r1[c]+=r2[c];
    }
}
int main()
{
    int i,j;
    int *m1_c,*m2_c;
    int *m1_d,*m2_d;
    size_t pitch;
    cudaError_t err;
    m1_c=(int *)malloc(16*sizeof(int));
    m2_c=(int *)malloc(16*sizeof(int));
    for(i=0;i<4;i++)
    {
        for(j=0;j<4;j++)
        {
            m1_c[(i*4)+j]=rand()%10;
            m2_c[(i*4)+j]=rand()%10;
        }
    }
    for(i=0;i<4;i++)
    {
        for(j=0;j<4;j++)
        {
            printf("%d\t",m1_c[(i*4)+j]);
        }
        printf("\n");
    }
    printf("\n\n");
    for(i=0;i<4;i++)
    {
        for(j=0;j<4;j++)
        {
            printf("%d\t",m2_c[(i*4)+j]);
        }
        printf("\n");
    }
    err=cudaMallocPitch((void **)&m1_d,&pitch,siz,4);
    cudaCheckErrors("cm1");
    err=cudaMallocPitch((void **)&m2_d,&pitch,siz,4);
    cudaCheckErrors("cm2");
    err=cudaMemcpy2D(m1_d,pitch,m1_c,siz,siz,4,cudaMemcpyHostToDevice);
    cudaCheckErrors("cm3");
    err=cudaMemcpy2D(m2_d,pitch,m2_c,siz,siz,4,cudaMemcpyHostToDevice);
    cudaCheckErrors("cm4");
    dim3 grid(1);
    dim3 block(4);
    addmatrix<<<grid,block>>>(m1_d,m2_d,pitch);
    cudaMemcpy2D(m1_c,siz,m1_d,pitch,siz,4,cudaMemcpyDeviceToHost);
    cudaCheckErrors("cm5");

    for(i=0;i<4;i++)
    {
        for(j=0;j<4;j++)
        {
            printf("%d\t",m1_c[(i*4)+j]);
        }
        printf("\n");
    }
    err=cudaFree(m1_d);
    err=cudaFree(m2_d);
    err=cudaDeviceReset();
}

Thank you Mr. Robert Crovella. I got some idea to do this. But still I cannot Able to fix the Segmentation fault. — dambigan, Apr 05 '13 at 09:10

CUDA 2D array nvidia

1 Answers1