While trying to understand how cudaMalloc() works for 2d matrix I came across the following post:
Using cudaMalloc to allocate a matrix
I wanted to clarify some points of the answer given by talonmies hence created this seperate post. Talonmies gave the following solution.
float **pa;
float **pah = (float **)malloc(pah, N * sizeof(float *));
cudaMalloc((void***)&pa, N*sizeof(float*));
for(i=0; i<N; i++) {
cudaMalloc((void**) &(pah[i]), N*sizeof(float));
cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);
}
cudaMemcpy (pa, pah, N*sizeof(float *), cudaMemcpyHostToDevice);
The code in line 5 :
cudaMalloc((void**) &(pah[i]), N*sizeof(float));
creates a block N*float space in device memory and put puts the starting address of the i-th block of device memory in pah[i]. pah[i] reside in the host memory, but the content of each pah[i] is the address of the memory created in the device.
Question 1> Is the above understanding correct ?
The code in line 6:
cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);
copies A[i] from host to the content of pah[i] (the content of pah[i] being the starting address of each of the N*float blocks).
Question 2> Is the above understanding of how the host memory gets copied to device memory correct ?
In order to access the (N,N) block of memory in the device (created by line 5 above) like a 2-d array we now need to copy the contents of all the pah[i]'s to a pointer in the device. So first N float pointers are created in the device by code in line 3. And then the address of the of the N*float chunks are copied from pah[i] to pa using code in line 8. After this we will be able to access contents of A[i][j] residing in host with pa[i][j] residing in device.
Question 3> Is the above understanding correct ?
Now say I spawn N*N thread and change the content of pa[i][j] with the thread id of each of the threads. Then I want to copy back the content of pa[i][j] residing in the device to A[i][j] residing in the host. Will the code line below do the job, or am I making any mistake ?
for (i=0; i<N; i++)
cudaMemcpy(A[i], pa[i], N*sizeof(float), cudaMemcpyDeviceToHost);
Thanks in advance to all who helps me to clarify these doubts/questions.
Best