So there are several problems with this code. In no particular order:
- You are indexing through your various arrays from 1 to 4, but this is not correct in C. C indexing starts at zero and goes to one less than the dimension. This has nothing to do with CUDA.
- cudaMemcpy2D expects two pointers (
src
and dst
) both of which are pointers to linear arrays in memory. I realize this is confusing since 2D appears all over the description, but the two pointer parameters are fundamentally both of the same type (a pointer to memory) and you are passing pointers of 2 different types (one is a pointer to memory, the other is a pointer to a pointer to memory). So cleary from the definition of cudaMemcpy2D, your usage cannot be correct. There are plenty of answered questions with examples about how to use cudaMemcpy2D, I suggest you search and review some of them. Note that fixing this issue will probably cause you to fundamentally re-think how you want to store the data on the host matrices. There are plenty of questions such as this one about handling multidimensional matrices -- you should flatten them if possible. Note that in your current code, this error with cudaMemcpy2D usage is destroying the pointer array on your host matrices, which is resulting in a seg fault when you try to print the results.
- Your parameters passed to cudaMallocPitch are not quite right. For both the
width
and height
parameters you are passing siz
which is the matrix dimension in bytes. But you should only pass the byte-dimension for the width
parameter. For the height
parameter you should pass the number of rows, i.e. 4 in your case. There is a similar requirement on the call to cudaMemcpy2D but you got it right there.
- Now let's look at your kernel. In the invocation, you are launching a grid of one block of 16 threads. Since your matrices have 16 elements, that seems sensible. That implies a thread strategy where each thread will be responsible for a single element of the result. But looking at your kernel code, you have each thread computing the result of an entire row, i.e. 4 elements. There are 2 ways to fix this: you could either reduce your grid to 4 threads instead of 16 threads (simpler, probably, from a code modification standpoint), or you could re-write your kernel (eliminate the for-loop) and have each thread compute a single output element (which will probably do more work in parallel).
- Additionally, in your kernel, you are using the
pitch
parameter in pointer-arithmetic based indexing. But remember that pitch is in bytes and for pointer-arithmetic indexing, the compiler expects the parameters to be in elements - it does the conversion to bytes for you, based on the data type. Again, this is really a C issue, and not specific to CUDA. You could fix this by using (pitch/sizeof(int))
wherever you are using pitch
in the kernel.
- You are passing
siz
for the pitch to your kernel. You should be passing pitch
for the pitch parameter. siz
is effectively the "pitch" on the host data storage, but pitch
is the pitch of the storage on the device. The kernel is operating on the device storage, so it needs the correct pitch.
- As a recommendation, do cuda error checking on all cuda API calls and kernel calls.
Here is some code which addresses all of the above issues, in one fashion or another:
#include<stdio.h>
#define siz (4*sizeof(int))
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void addmatrix(int *m1,int *m2,size_t pitch)
{
int r=threadIdx.x;
int *r1=m1+r*(pitch/sizeof(int));
int *r2=m2+r*(pitch/sizeof(int));
int c;
for(c=0;c<4;c++)
{
r1[c]+=r2[c];
}
}
int main()
{
int i,j;
int *m1_c,*m2_c;
int *m1_d,*m2_d;
size_t pitch;
cudaError_t err;
m1_c=(int *)malloc(16*sizeof(int));
m2_c=(int *)malloc(16*sizeof(int));
for(i=0;i<4;i++)
{
for(j=0;j<4;j++)
{
m1_c[(i*4)+j]=rand()%10;
m2_c[(i*4)+j]=rand()%10;
}
}
for(i=0;i<4;i++)
{
for(j=0;j<4;j++)
{
printf("%d\t",m1_c[(i*4)+j]);
}
printf("\n");
}
printf("\n\n");
for(i=0;i<4;i++)
{
for(j=0;j<4;j++)
{
printf("%d\t",m2_c[(i*4)+j]);
}
printf("\n");
}
err=cudaMallocPitch((void **)&m1_d,&pitch,siz,4);
cudaCheckErrors("cm1");
err=cudaMallocPitch((void **)&m2_d,&pitch,siz,4);
cudaCheckErrors("cm2");
err=cudaMemcpy2D(m1_d,pitch,m1_c,siz,siz,4,cudaMemcpyHostToDevice);
cudaCheckErrors("cm3");
err=cudaMemcpy2D(m2_d,pitch,m2_c,siz,siz,4,cudaMemcpyHostToDevice);
cudaCheckErrors("cm4");
dim3 grid(1);
dim3 block(4);
addmatrix<<<grid,block>>>(m1_d,m2_d,pitch);
cudaMemcpy2D(m1_c,siz,m1_d,pitch,siz,4,cudaMemcpyDeviceToHost);
cudaCheckErrors("cm5");
for(i=0;i<4;i++)
{
for(j=0;j<4;j++)
{
printf("%d\t",m1_c[(i*4)+j]);
}
printf("\n");
}
err=cudaFree(m1_d);
err=cudaFree(m2_d);
err=cudaDeviceReset();
}