1

I want to print d_t global 2D array variable using "printf" inside main method. But I got a compile warning saying that:

a __device__ variable "d_t" cannot be directly read in a host function

How can I copy global 2D array variable from device to host and then print the first column of each row?

__device__ double *d_t;

__device__ size_t d_gridPitch;


__global__ void kernelFunc()
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    double* rowt = (double*)((char *)d_t + i * d_gridPitch);
    rowt[0] = rowt[0] + 40000;

}


int main()
{
    int size = 16;
    size_t d_pitchLoc;
    double *d_tLoc;

    cudaMallocPitch((void**)&d_tLoc, &d_pitchLoc, size * sizeof(double), size);
    cudaMemset2D(d_tLoc, d_pitchLoc, 0, size * sizeof(double), size);

    cudaMemcpyToSymbol(d_gridPitch, &d_pitchLoc, sizeof(int));
    cudaMemcpyToSymbol(d_t, & d_tLoc, sizeof(d_tLoc));

    kernelFunc<<<1,size>>>();

    for(int i=0; i< size; i++){
        double* rowt = (double*)((char *)d_t + i * d_gridPitch);
        printf("%.0f, ",rowt[0]);
    }

    cudaDeviceReset();

    return 0;
}
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • The `cudaMemcpy2D` function is used to copy to or from a pitched allocation (i.e. created with `cudaMallocPitch`). [Here](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g3a58270f6775efe56c65ac47843e7cee) is the API documentation for `cudaMemcpy2D`. If you search on this CUDA tag you will find *many* questions and answers that demonstrate proper usage, such as [this one](http://stackoverflow.com/questions/35771430/cuda-cudamallocpitch-cudamemcpy2d-with-2d-array-error). Use [proper CUDA error checking](http://stackoverflow.com/questions/14038589). – Robert Crovella Mar 26 '16 at 00:30

1 Answers1

1

As indicated in comments, the cudaMemcpy2D API is designed for exactly this task. You must allocate or statically define a host memory buffer or container to act as storage for the data from the device, and then provide the pitch of that host buffer to the cudaMemcpy2D call. The API handles the pitch conversion without any further intervention on the caller side.

If you replace the print loop with something like this:

double* h_t = new double[size * size];
cudaMemcpy2D(h_t, size * sizeof(double), d_tLoc, d_pitchLoc, 
        size * sizeof(double), size, cudaMemcpyDeviceToHost);
for(int i=0, j=0; i< size; i++){
    std::cout << h_t[i * size + j] << std::endl; 
}

[Note I'm using iostream here for the printing. CUDA uses a C++ compiler for compiling host code and you should prefer iostream functions over cstdio because they are less error prone and support improve diagnostics on most platforms].

You can see that the API call form is very similar to the cudaMemset2D call that I provided for you in your last question.

talonmies
  • 70,661
  • 34
  • 192
  • 269