1

I am looking into how to copy a 2D array of variable width for each row into the GPU.

int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;

...

Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?

According to this thread, that won't be a clever way of doing it:

cudaMalloc(d_array, rows*sizeof(int*));  
for(int i = 0 ; i < rows ; i++)    {  
    cudaMalloc((void **)&d_array[i], length[i] * sizeof(int)); 
}  

But I cannot think of any other method. Is there any other smarter way of doing it? Can it be improved using cudaMallocPitch and cudaMemCpy2D ??

Community
  • 1
  • 1
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • 2
    You *really* don't want to do this. 1000 rows = 1001 cudaMalloc calls to setup, then 1001 cudaMemcpy calls every time you want to move data between the host and device. The API overhead costs will kill the performance of even the most embarrassingly parallel problem. – talonmies Sep 18 '12 at 17:21

2 Answers2

5

The correct way to allocate an array of pointers for the GPU in CUDA is something like this:

int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));  
for(int i = 0 ; i < nrows ; i++)    {  
    cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int)); 
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);

(disclaimer: written in browser, never compiled, never tested, use at own risk)

The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.

If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:

int * data, * rows, * lengths;

cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));

In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • @ talomies Let me try that one. It seems to be a reasonable performance solution. By the way, with your first solution seems that you do 1000 cudaMalloc and only 1 cudaMemcpy. Am I missing something? – Manolete Sep 19 '12 at 07:36
  • @Manolete: yes you are missing something. That loop only *allocates* the memory on the device. You would still require another 1000 `cudaMemcpy` calls to transfer your data to the device, one per row. – talonmies Sep 19 '12 at 08:57
  • @ talomies: Right, it doesn't seem to be a good option considering my numbers. It would be better to have a fixed size double array and using `cudaMallocPitch` in that case. About your second option,could you provide me a bit more of information (sources, links) to understand how I can do it? – Manolete Sep 19 '12 at 09:12
  • @Manolete: I am not sure what you mean by a "fixed sized double array", but if you mean what I think you do it isn't what `cudaMallocPitch` does. `cudaMallocPitch` just gives a one dimension, linear memory allocation just like `cudaMalloc`. The only difference is that the API internally calculates the size (probably including padding) and gives you a linear pitch to use during one dimensional indexing that is optimal for some memory operations on the device (like binding to textures). – talonmies Sep 19 '12 at 09:28
  • @ talonmies: You are right, that is not what I need to do. Thanks for the explanation. – Manolete Sep 19 '12 at 11:41
0

I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i] Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

Albert S
  • 2,552
  • 1
  • 22
  • 28