3

I am using cudaMalloc and cudaMemcpy to allocate a matrix and copy into it arrays of vectors , like this:

float **pa;    
cudaMalloc((void***)&pa,  N*sizeof(float*)); //this seems to be ok
for(i=0; i<N; i++) {
    cudaMalloc((void**) &(pa[i]), N*sizeof(float)); //this gives seg fault
    cudaMemcpy (pa[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice); // also i am not sure about this
}

What is wrong with my instructions? Thanks in advance

P.S. A[i] is a vector


Now i'm trying to copy a matrix from the Device to a matrix from the host:

Supposing I have **pc in the device, and **pgpu is in the host:

cudaMemcpy (pgpu, pc, N*sizeof(float*), cudaMemcpyDeviceToHost);
for (i=0; i<N; i++)
    cudaMemcpy(pgpu[i], pc[i], N*sizeof(float), cudaMemcpyDeviceToHost);

= is wrong....

Madrugada
  • 1,261
  • 8
  • 24
  • 44

2 Answers2

5

pa is in device memory, so &(pa[i]) does not do what you are expecting it will. This will work

float **pa;
float **pah = (float **)malloc(pah, N * sizeof(float *));    
cudaMalloc((void***)&pa,  N*sizeof(float*));
for(i=0; i<N; i++) {
    cudaMalloc((void**) &(pah[i]), N*sizeof(float));
    cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice);
}
cudaMemcpy (pa, pah, N*sizeof(float *), cudaMemcpyHostToDevice);

ie. build the array of pointers in host memory and then copy it to the device. I am not sure what you are hoping to read from A, but I suspect that the inner cudaMemcpy probably isn't doing what you want as written.

Be forewarned that from a performance point of view, arrays of pointers are not a good idea on the GPU.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • thank you for your answer. Why arrays of pointers are not good for GPU? – Madrugada May 04 '11 at 15:18
  • 1
    Because an array of pointer requires two memory transactions to retrieve a value from global memory. Global memory access has very high latency on the GPU, so two trips to global memory to get a value is far less preferable than one plus a few IOPs, which is what indexing into a linear 1D memory allocation costs. – talonmies May 04 '11 at 15:26
  • did you mean: cudaMemcpy (pah[i], A[i], N*sizeof(float), cudaMemcpyHostToDevice); in the first line after the for ? (A is supposed to be a matrix in my program thus A[i] a vector) – Madrugada May 04 '11 at 17:52
  • Sorry, that was a typo, it should be pah in the cudaMemcpy. – talonmies May 04 '11 at 18:19
  • How can I copy a matrix from the device to a matrix from the host? I have done this (pgpu is in the host, pc in the device): cudaMemcpy(pgpu, pc, N*sizeof(float*), cudaMemcpyDeviceToHost); however when i try to access pgpu it gives seg fault – Madrugada May 04 '11 at 18:39
  • Do the same procedure as the code above, but with the inner cudaMemcpy reversed. You need to use the host array of device pointers you filled during the host->device copy. – talonmies May 04 '11 at 18:47
  • I've done this: float **aux; cudaMalloc((void***)&aux, N*sizeof(float*)); for(i=0,N..) {aux[i] = (float*)malloc(N*sizeof(float)); cudaMemcpy(aux[i], pc[i], N*sizeof(float), cudaMemcpyDeviceToHost); } cudaMemcpy(pgpu, aux, N*sizeof(float*), cudaMemcpyDeviceToHost); //and it's not working – Madrugada May 04 '11 at 19:01
  • Of course it will segmentation fault. You are mixing device and host pointers again. There is little point continuing this in comments. Edit what you have done into the question where is it readable and I *might* look at it again. – talonmies May 04 '11 at 19:21
  • cudaMalloc((void**) &(pah[i]), N*sizeof(float)); should be cudaMalloc((void**) &(pa[i]), N*sizeof(float)); right?! – dreamcrash Aug 01 '13 at 08:10
  • @dreamcrash: no. pa is in device memory, its elements can't be directly accessed in the host API. You would get a segfault if you tried (literally the first mistake people make when trying to do this, if you leave aside trying to do it in the first place). The whole idea is to built a copy of what will be the device array of device pointers in host memory, then copy that to the device. – talonmies Aug 01 '13 at 08:22
  • @talonmies thanks, it makes sense, I should had read first the OP question/your answer, before I made my question. Thanks for the patience. – dreamcrash Aug 01 '13 at 08:36
2

What is your eventual goal of this code? As hinted at above, it would probably be in your best interest to flatten pa into a 1-dimensional array for usage on the GPU. Something like:

float *pa;
cudaMalloc((void**)&pa, N*N*sizeof(float));

Unfortunately you'd have to adjust A[i] to do your memory copy this way though.

Adam27X
  • 889
  • 1
  • 7
  • 16