understanding the usage of cudaMalloc to allocate a matrix

Question

While trying to understand how cudaMalloc() works for 2d matrix I came across the following post:

I wanted to clarify some points of the answer given by talonmies hence created this seperate post. Talonmies gave the following solution.

float **pa;
float **pah = (float **)malloc(pah, N * sizeof(float *));    
cudaMalloc((void***)&pa,  N*sizeof(float*));
for(i=0; i<N; i++) {
    cudaMalloc((void**) &(pah[i]), N*sizeof(float));
    cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);
}
cudaMemcpy (pa, pah, N*sizeof(float *), cudaMemcpyHostToDevice);

The code in line 5 :

cudaMalloc((void**) &(pah[i]), N*sizeof(float));

creates a block N*float space in device memory and put puts the starting address of the i-th block of device memory in pah[i]. pah[i] reside in the host memory, but the content of each pah[i] is the address of the memory created in the device.

Question 1> Is the above understanding correct ?

The code in line 6:

 cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);

copies A[i] from host to the content of pah[i] (the content of pah[i] being the starting address of each of the N*float blocks).

Question 2> Is the above understanding of how the host memory gets copied to device memory correct ?

In order to access the (N,N) block of memory in the device (created by line 5 above) like a 2-d array we now need to copy the contents of all the pah[i]'s to a pointer in the device. So first N float pointers are created in the device by code in line 3. And then the address of the of the N*float chunks are copied from pah[i] to pa using code in line 8. After this we will be able to access contents of A[i][j] residing in host with pa[i][j] residing in device.

Question 3> Is the above understanding correct ?

Now say I spawn N*N thread and change the content of pa[i][j] with the thread id of each of the threads. Then I want to copy back the content of pa[i][j] residing in the device to A[i][j] residing in the host. Will the code line below do the job, or am I making any mistake ?

for (i=0; i<N; i++)
  cudaMemcpy(A[i], pa[i], N*sizeof(float), cudaMemcpyDeviceToHost);

Thanks in advance to all who helps me to clarify these doubts/questions.

Best

score 2 · Accepted Answer · edited May 23 '17 at 10:31

Question 1> Is the above understanding correct ?

Yes.

Question 2> Is the above understanding of how the host memory gets copied to device memory correct ?

Perhaps. I would say: "copies N*sizeof(float) bytes, starting at the (host) address contained in A[i] from host to the device, starting at the device address contained in pah[i]."

Question 3> Is the above understanding correct ?

Yes, I might word a few things differently but the changes I might make seem minor. I think you've got it.

Will the code line below do the job, or am I making any mistake ?

It should be:

for (i=0; i<N; i++)
  cudaMemcpy(A[i], pah[i], N*sizeof(float), cudaMemcpyDeviceToHost);

You are literally wanting to reverse the operation contained in line 6. Don't forget to use proper cuda error checking any time you are having trouble with CUDA code.

thank you very much for your inputs/comments. I understand the mechanism much better now. — user1612986, Mar 06 '14 at 13:15

understanding the usage of cudaMalloc to allocate a matrix

1 Answers1