Basic CUDA pointer/array memory allocation and use

Question

I started CUDA last week as I have to convert an existing c++ programme to cuda for my research.

This is a basic example from the CUDA by Example book, which I reccommend to anyone who wants to learn CUDA!

Can someone explain how you can assign GPU memory with 'dev_c' which is an empty pointer?

HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

Then, not pass any 'dev_c' values when calling the function 'add' but treat *c as an array in the global function and write to it from within the function? Why is this possible when its not defined as an array anywhere?

add<<<N,1>>>( dev_a, dev_b, dev_c );

Finally, where exactly do the terms c[0], c[1] etc. get saved when performing the following addition?

c[tid] = a[tid] + b[tid];

I hope I am explaining myself well but feel free to ask any follow-up questions. New to C as well as CUDA so be nice :D

Entire code below:

#include "book.h"

#define N   1000

__global__ void add( int *a, int *b, int *c ) {
    int tid = blockIdx.x;    // this thread handles the data at its thread id
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

int main( void ) {
    int a[N], b[N], c[N];
    int *dev_a, *dev_b, *dev_c;

    // allocate the memory on the GPU
    HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = -i;
        b[i] = i * i;
    }

    // copy the arrays 'a' and 'b' to the GPU
    HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
                                cudaMemcpyHostToDevice ) );
    HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
                                cudaMemcpyHostToDevice ) );

    add<<<N,1>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
                                cudaMemcpyDeviceToHost ) );

    // display the results
    for (int i=0; i<N; i++) {
        printf( "%d + %d = %d\n", a[i], b[i], c[i] );
    }

    // free the memory allocated on the GPU
    HANDLE_ERROR( cudaFree( dev_a ) );
    HANDLE_ERROR( cudaFree( dev_b ) );
    HANDLE_ERROR( cudaFree( dev_c ) );

    return 0;
}

Thank you!

score 2 · Accepted Answer · edited May 23 '17 at 12:05

It's not going to be possible to teach CUDA in the space of an SO question. I will try to answer your questions, but you should probably avail yourself of some resources. It will be especially difficult if you don't know C or C++, because typical CUDA programming depends on those.

You might want to take some introductory webinars here such as:

GPU Computing using CUDA C – An Introduction (2010) An introduction to the basics of GPU computing using CUDA C. Concepts will be illustrated with walkthroughs of code samples. No prior GPU Computing experience required

GPU Computing using CUDA C – Advanced 1 (2010) First level optimization techniques such as global memory optimization, and processor utilization. Concepts will be illustrated using real code examples

Now to your questions:

Can someone explain how you can assign GPU memory with 'dev_c' which is an empty pointer?

dev_c starts out as an empty pointer. But the cudaMalloc function allocates GPU memory according to the size passed to it, establishes a pointer to that allocation, and stores that pointer into the dev_c pointer. It can do this because we are passing the address of dev_c, not the actual pointer itself.

Then, not pass any 'dev_c' values when calling the function 'add' but treat *c as an array in the global function and write to it from within the function? Why is this possible when its not defined as an array anywhere?

In C, a pointer (which is what dev_c is) can point to a single value or an array of values. The pointer itself does not contain information about how much data it is pointing to. Since dev_c is storing the result, and it has already been properly initialized by the preceding cudaMalloc function, we can use it to store the result of the operations in the kernel. dev_c actually points to an area of storage of (an array of) int, the size of which is given by N * sizeof(int), as passed to the preceding cudaMalloc function.

Finally, where exactly do the terms c[0], c[1] etc. get saved when performing the following addition?

In c, when we have a function definition like so:

void my_function(int *c){...}

This says that statements within the function can reference a variable named c as if it were a pointer to one or more int values (either a single value or an array of values, stored beginning at the location pointed to by c).

When we call that function, we can use some other variable named as an argument, for the function parameter called c, like so:

int my_ints[32];
my_function(my_ints);

Now, inside my_function, wherever the parameter c is referenced, it will use the argument value given by the (pointer) my_ints.

The same concepts hold for cuda functions (kernels) and their arguments and parameters.

Thank you, that makes a bit more sense now! So does that mean the values in (array) c gets saved in the GPU global memory that was allocated earlier in cudaMalloc( (void**)&dev_c, N * sizeof(int) )? — user2550888, Jul 04 '13 at 16:17
Yes. The kernel usage of `c` gets stored in the *argument* passed for the `c` parameter, which in this case is `dev_c`. And `dev_c` has been previously set up with the allocated size in device global memory. This is essentially C behavior, and has almost nothing to do with CUDA. — Robert Crovella, Jul 04 '13 at 16:21
Got it! Thanks again for the detailed responses, much appreciated! — user2550888, Jul 04 '13 at 16:27

Basic CUDA pointer/array memory allocation and use

1 Answers1