1

I want to dynamically allocate global 2D array in CUDA. How can I achieve this?

In my main I am calling my Kernel in a loop. But before I call the kernel I need to allocate some memory on the GPU. After the kernel call a single integer is send from GPU to CPU to inform whether the problem is solved or not.
If the problem is not solved, I will not free the old memory , since there is a further need of it, and I should allocate new memory to the GPU and call the kernel again.

a sudocode is shown:

int n=0,i=0;
while(n==0)
{
    //allocate 2d memory for MEM[i++] 
    //call kernel(MEM,i)
    // get n from kernel       
}


__global__ void kernerl(Mem,int i)
{
    Mem[0][5]=1;
    Mem[1][0]=Mem[0][5]+23;//can use this when MEM[1] is allocated before kernel call
}

Any suggestions? Thank you.

Deepu
  • 7,592
  • 4
  • 25
  • 47
liza
  • 31
  • 2
  • 4

3 Answers3

6

Two opening comments - using a dynamically allocated 2D array is a bad idea in CUDA, and doing repetitive memory allocations in a loop is also not a good idea. Both incur needless performance penalties.

For the host code, something like this:

size_t allocsize = 16000 * sizeof(float);
int n_allocations = 16;
float * dpointer
cudaMalloc((void **)&dpointer, n_allocations * size_t(allocsize));

float * dcurrent = dpointer;
int n = 0;
for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) {

    // whatever you do before the kernel

    kernel <<< gridsize,blocksize >>> (dcurrent,.....);

    // whatever you do after the kernel

}

is preferable. Here you only call cudaMalloc once, and pass offsets into the allocation, which makes memory allocation and management free inside the loop. The loop structure also means you can't run endlessly and exhaust all the GPU memory.

On the 2D array question itself, there are two reasons why it is a bad idea. Firstly, the allocation requires of a 2D array with N rows requires (N+1) cudaMalloc calls and a host device memory copy, which is slow and ugly. Secondly inside the kernel code, to get at your data, the GPU must do two global memory reads, one for the pointer indirection to get the row address, and then one to fetch from the data from the row. That is much slower than this alternative:

#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
__global__ void kernerl(float * Mem, int lda, ....)
{
    Mem[idx(0,5,lda)]=1; // MemMem[0][5]=1;
}

which uses indexing into a 1D allocation. In the GPU memory transactions are very expensive, but FLOPS and IOPS are cheap. A single integer multiply-add is the most efficient way to do this. If you need to access results from a previous kernel call, just pass the offset to the previous results and use two pointers inside the kernel, something like this:

__global__ void kernel(float *Mem, int lda, int this, int previous)
{
   float * Mem0 = Mem + this;
   float * Mem1 = Mem + previous;

}

Efficient distributed memory programs (and CUDA is really a type of distributed memory programming) start to look like Fortran after a while, but that is the price you pay for portability, transparency and efficiency.

Hope this helped.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • your answer is well done and really better than mine, but I had to delete examples I posted using 1D arrays (telling that you can use 1D array in place of one 2D) because they voted me down twice for that. Look at the comments (I totally deleted the previous answer). – Marco Apr 12 '11 at 07:46
2

Well, you can do it just as it would be done on CPU.

unsigned xSize = 666, ySize = 666;
int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
int **d_ptr = NULL;
cudaMalloc( &d_ptr, xSize );
for(unsigned i = 0; i < xSize; ++i)
{
    cudaMalloc( &h_ptr[i], ySize );
}
cudaMemcpy( &d_ptr, &h_ptr, sizeof(int*) * xSize, cudaMemcpyHostToDevice );
free( h_ptr );

...and free similiarly

int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
cudaMemcpy( &h_ptr, &d_ptr, sizeof(int*) * xSize, cudaMemcpyDeviceToHost );
for(unsigned i = 0; i < xSize; ++i )
{
    cudaFree( h_ptr[i] );
}
cudaFree( d_ptr );
free( h_ptr );

But you should keep in mind, that every access to a cell of this array will involve accesing GPU global memory twice. Due to that, memory access will be two times slower than with 1d array.

x13n
  • 4,103
  • 2
  • 21
  • 28
  • 1
    That first snippet won't work. You cannot directly cudaMalloc the elements of d_ptr from the host, because d_ptr holds a device address. You will get an illegal memory access error on the host if you do so. The only way to do it is to allocate each row into a **host** array of pointers, then copy that complete array onto a device array of pointers. – talonmies Apr 12 '11 at 07:39
  • 1
    it is a trap that everyone falls into at least once. – talonmies Apr 12 '11 at 08:11
0

EDITED:
I was trying to help you providing an example in which, flattening the array, you can achieve the same result, but mates told me it's not what you're asking for.
So there is another post here telling you how you can allocate 2d arrays in CUDA.

Community
  • 1
  • 1
Marco
  • 56,740
  • 14
  • 129
  • 152
  • In CUDA allocating 2D array is like allocating 1D array, because you can see in memory like a plain 1D array in many kernels... Why voting down?!? I'm trying to help her.... – Marco Apr 12 '11 at 06:40
  • in CUDA it is possible to allocate 2d array using cudaMallocPitch and cudaMemcpy2D. but i can't answer the question because i am not sure if this can be done in a loop... – scatman Apr 12 '11 at 06:41
  • @scatman: you're right, I know, but I'm just trying to help her telling that you can allocate a 2D array like allocating 1D one using different dimensions. And you can use it the way you please inside kernels. – Marco Apr 12 '11 at 06:52
  • @scatman, @Ghyath: I edited my post removing examples, even if I think she could get some use of it. Hope this helps her a little more. – Marco Apr 12 '11 at 07:12
  • @scatman: cudaMallocPitch does not allocate 2D arrays. It only allocates a linear memory allocation which has padding calculated to play nicely with the GPU texture hardware. There is no API in CUDA that will "automagically" allocate a dynamic C array of pointers with each array entry also allocated to a requested size. There is also no API call that will "deep" copy such memory between the host and device. – talonmies Apr 12 '11 at 07:46
  • @marco keep up the good work - it's increasingly apparent that the OP needs a lot of help and needed to know how to turn the 2D array into a flat one! it sounds like you just needed to explain more clearly why you considered 1D the correct solution! – jmilloy Apr 12 '11 at 15:16
  • @jmilloy: sorry, my English is not good enough to understand if you're really telling me something good or (as I think) you're fooling me. I started presenting a working code with 1D array, hoping it could be useful for liza. Other users gave me bad comments and votes for that because my post was not what liza needed. So I tried to post a link, no more. talonmies in his post says it'a bad idea and he's more and more competent than me: I'm a newbie in CUDA, having developed only some small app with it. I was just trying te help someone like others helped me. If you tell me I delete my post :) – Marco Apr 12 '11 at 17:31
  • @marco i'm telling you that you did something good, and that i'm sorry you received bad comments – jmilloy Apr 12 '11 at 18:19
  • @jmilloy: OMG, thanks, I thought the contrary!! Luckily talonmies explained much much much better than me, his English is perfect and he's skilled in CUDA hundred times more than me. I'm just sad for those who gave me bad comments without posting a good solution. I think to SO as a collaborative site in which giving and taking help is the scope. – Marco Apr 12 '11 at 18:34