0

While trying to understand how cudaMalloc() works for 2d matrix I came across the following post:

Using cudaMalloc to allocate a matrix

I wanted to clarify some points of the answer given by talonmies hence created this seperate post. Talonmies gave the following solution.

float **pa;
float **pah = (float **)malloc(pah, N * sizeof(float *));    
cudaMalloc((void***)&pa,  N*sizeof(float*));
for(i=0; i<N; i++) {
    cudaMalloc((void**) &(pah[i]), N*sizeof(float));
    cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);
}
cudaMemcpy (pa, pah, N*sizeof(float *), cudaMemcpyHostToDevice);

The code in line 5 :

cudaMalloc((void**) &(pah[i]), N*sizeof(float));

creates a block N*float space in device memory and put puts the starting address of the i-th block of device memory in pah[i]. pah[i] reside in the host memory, but the content of each pah[i] is the address of the memory created in the device.

Question 1> Is the above understanding correct ?

The code in line 6:

 cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);

copies A[i] from host to the content of pah[i] (the content of pah[i] being the starting address of each of the N*float blocks).

Question 2> Is the above understanding of how the host memory gets copied to device memory correct ?

In order to access the (N,N) block of memory in the device (created by line 5 above) like a 2-d array we now need to copy the contents of all the pah[i]'s to a pointer in the device. So first N float pointers are created in the device by code in line 3. And then the address of the of the N*float chunks are copied from pah[i] to pa using code in line 8. After this we will be able to access contents of A[i][j] residing in host with pa[i][j] residing in device.

Question 3> Is the above understanding correct ?

Now say I spawn N*N thread and change the content of pa[i][j] with the thread id of each of the threads. Then I want to copy back the content of pa[i][j] residing in the device to A[i][j] residing in the host. Will the code line below do the job, or am I making any mistake ?

for (i=0; i<N; i++)
  cudaMemcpy(A[i], pa[i], N*sizeof(float), cudaMemcpyDeviceToHost);  

Thanks in advance to all who helps me to clarify these doubts/questions.

Best

Community
  • 1
  • 1
user1612986
  • 1,373
  • 3
  • 22
  • 38

1 Answers1

2

Question 1> Is the above understanding correct ?

Yes.

Question 2> Is the above understanding of how the host memory gets copied to device memory correct ?

Perhaps. I would say: "copies N*sizeof(float) bytes, starting at the (host) address contained in A[i] from host to the device, starting at the device address contained in pah[i]."

Question 3> Is the above understanding correct ?

Yes, I might word a few things differently but the changes I might make seem minor. I think you've got it.

Will the code line below do the job, or am I making any mistake ?

It should be:

for (i=0; i<N; i++)
  cudaMemcpy(A[i], pah[i], N*sizeof(float), cudaMemcpyDeviceToHost); 

You are literally wanting to reverse the operation contained in line 6. Don't forget to use proper cuda error checking any time you are having trouble with CUDA code.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257