This question may be useful for background.
Perhaps you don't know what a pitched allocation is. A pitched allocation looks like this:
X X X P P P
X X X P P P
X X X P P P
The above could represent storage for a 3x3 array (elements represented by X
) that is pitched (pitched value of 6 elements, pitch "elements" represented by P
).
You'll have no luck accessing such a storage arrangement if you don't follow the guidelines given in the reference manual for cudaMallocPitch
. In-kernel access to such a pitched allocation should be done as follows:
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
You'll note that the above formula depends on the pitch
value that was provided at the point of cudaMallocPitch
. If you don't pass that value to your kernel, you won't have any luck with this.
Because you are not doing that, the proximal reason for your observation:
the code will print (1 2 3 0 0 0 0 0 0)
is because your indexing is reading just the first "row" of that pitched allocation, and the P
elements are showing up as zero (although that's not guaranteed.)
We can fix your code simply by implementing the suggestions given in the reference manual:
$ cat t2153.cu
#include <cstdio>
const size_t N = 3;
__global__ void kernal_print(double* d_A, size_t my_N, size_t pitch){
// int xIdx = threadIdx.x + blockDim.x * blockIdx.x;
// int yIdx = threadIdx.y + blockDim.y * blockIdx.y;
printf("\n");
for(int row = 0; row < my_N; row++)
for (int col = 0; col < my_N; col++){
double* pElement = (double *)((char*)d_A + row * pitch) + col;
printf("%f, ",*pElement);
}
printf("\n");
}
void function(){
double A[N][N];
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
A[row][col] = row*N+col+1;
double* d_A;
size_t pitch;
cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);
int threadnum = 1;
int blocknum = 1;
kernal_print<<<blocknum, threadnum>>>(d_A, N, pitch);
cudaDeviceSynchronize();
}
int main(){
function();
}
$ nvcc -o t2153 t2153.cu
$ compute-sanitizer ./t2153
========= COMPUTE-SANITIZER
1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000,
========= ERROR SUMMARY: 0 errors
$
A few comments:
- The usage of the term 2D can have varied interpretations.
- Using a pitched allocation is not necessary for 2D work, and it may also have no practical value (not making your code simpler or more performant).
- For further discussion of the varied ways of doing "2D work", please read the answer I linked.
- This sort of allocation:
double A[N][N];
may give you trouble for large N
, because it is a stack-based allocation. Instead, use a dynamic allocation (which may affect a number of the methods you use to handle it.) There are various questions covering this, such as this one.