54

I am currently going through the tutorial examples on http://code.google.com/p/stanford-cs193g-sp2010/ to learn CUDA. The code which demostrates __global__ functions is given below. It simply creates two arrays, one on the CPU and one on the GPU, populates the GPU array with the number 7 and copies the GPU array data into the CPU array.

#include <stdlib.h>
#include <stdio.h>

__global__ void kernel(int *array)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;

  array[index] = 7;
}

int main(void)
{
  int num_elements = 256;

  int num_bytes = num_elements * sizeof(int);

  // pointers to host & device arrays
  int *device_array = 0;
  int *host_array = 0;

  // malloc a host array
  host_array = (int*)malloc(num_bytes);

  // cudaMalloc a device array
  cudaMalloc((void**)&device_array, num_bytes);

  int block_size = 128;
  int grid_size = num_elements / block_size;

  kernel<<<grid_size,block_size>>>(device_array);

  // download and inspect the result on the host:
  cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);

  // print out the result element by element
  for(int i=0; i < num_elements; ++i)
  {
    printf("%d ", host_array[i]);
  }

  // deallocate memory
  free(host_array);
  cudaFree(device_array);
} 

My question is why have they worded the cudaMalloc((void**)&device_array, num_bytes); statement with a double pointer? Even here definition of cudamalloc() on says the first argument is a double pointer.

Why not simply return a pointer to the beginning of the allocated memory on the GPU, just like the malloc function does on the CPU?

smilingbuddha
  • 14,334
  • 33
  • 112
  • 189
  • 9
    Because it returns an error code that tells you why it failed. Returning a null pointer on failure, like malloc(), is a poor substitute for an error code, doesn't mean anything more than "it didn't work". You are supposed to check it. – Hans Passant Nov 03 '11 at 00:57
  • 8
    @Hans: It's still a horrible API design. Instead, it should take an extra `int *error` argument to store the error code, which will be valid when the return value is a null pointer. As-is, the design negates all benefits of `void` pointers and requires you to jump through hoops to use the function correctly. – R.. GitHub STOP HELPING ICE Nov 03 '11 at 02:59
  • 2
    @R.: you get credit for offering an alternative at the same time you criticize the API - most API critics winge without proposing alternatives - but unless you believe that every CUDA runtime call should take an additional int * to pass back an error code, (which would make for a much more cluttered and difficult-to-use API), your alternative proposal is not orthogonal and violates the principle of least astonishment. – ArchaeaSoftware Nov 03 '11 at 11:58
  • 2
    Violating the principle of least astonishment is a much smaller offense than requiring temp `void *`'s all over the place. The `int *error` could be null when the user does not care about the reason. Actually I see no reason allocation could fail other than "out of memory" (and more importantly no reason the caller could care why it failed), so it's probably just a design mistake to begin with. – R.. GitHub STOP HELPING ICE Nov 03 '11 at 12:31
  • 1
    possible duplicate of [Why does cudaMalloc() use pointer to pointer?](http://stackoverflow.com/questions/12936986/why-does-cudamalloc-use-pointer-to-pointer) – chappjc May 23 '15 at 15:43
  • In C, do not cast the result of a call to `malloc()`, `realloc()`, or `calloc()` - it is unnecessary and potentially hides the serious error of a missing prototype. – mlp Jul 15 '19 at 16:08

5 Answers5

29

All CUDA API functions return an error code (or cudaSuccess if no error occured). All other parameters are passed by reference. However, in plain C you cannot have references, that's why you have to pass an address of the variable that you want the return information to be stored. Since you are returning a pointer, you need to pass a double-pointer.

Another well-known function which operates on addresses for the same reason is the scanf function. How many times have you forgotten to write this & before the variable that you want to store the value to? ;)

int i;
scanf("%d",&i);
CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • Is the double pointer in CUDA needed only because the API will dereference the pointer twice once to get the pointer to the data type stored in the device memory and the other to actually access the memory contents? – rgk Jan 20 '15 at 13:57
  • 4
    It is needed because the function *sets* the pointer. As with every output parameters in C, you need a pointer to an actual variable that you set, rather than the value itself. – CygnusX1 Jan 20 '15 at 18:42
27

This is simply a horrible, horrible API design. The problem with passing double-pointers for an allocation function that obtains abstract (void *) memory is that you have to make a temporary variable of type void * to hold the result, then assign it into the real pointer of the correct type you want to use. Casting, as in (void**)&device_array, is invalid C and results in undefined behavior. You should simply write a wrapper function that behaves like normal malloc and returns a pointer, as in:

void *fixed_cudaMalloc(size_t len)
{
    void *p;
    if (cudaMalloc(&p, len) == success_code) return p;
    return 0;
}
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 2
    I believe CUDART has a templated wrapper for cudaMalloc() that makes the (void **) cast unnecessary. Also, the function given here isn't something that I would advise putting into production; it hides too much useful information provided by the return value of cudaMalloc(). – ArchaeaSoftware Nov 03 '11 at 11:56
  • @ArchaeaSoftware: You're free to make the signature `void *fixed_cudamalloc(size_t len, int *errcodep)` or similar so that the caller can get the "useful information". – R.. GitHub STOP HELPING ICE Dec 09 '12 at 15:48
  • 6
    To coin a phrase, that would be "simply a horrible, horrible API design." Better to have a consistent return value across the various APIs, and pass back the allocated pointer; and then you are back where we started. – ArchaeaSoftware Dec 10 '12 at 14:50
  • I disagree. A foolish consistency is the hobgoblin of little minds. Dropping consistency of interface will not hurt anybody; mistakes due to it will be automatically caught by the compiler, and the only way to get past them is to check the proper interface signature and fix your code to use it. On the other hand, mistakes due to the original flawed interface will result in the vast majority of programmers adding a cast (the universal wrong way to fix pointer type issues) and introducing undefined behavior into their programs. – R.. GitHub STOP HELPING ICE Dec 10 '12 at 16:41
  • 6
    How would you return both error code and the pointer then? Note that error handling should be left to the user of the API, so it _has_ to be returned. – CygnusX1 Feb 19 '13 at 15:57
  • That would be even worse to use! e.g. you wouldn't be able to put the CUDA call in an `if` statement. – CygnusX1 May 15 '13 at 05:50
  • 1
    @CygnusX1: Sure you could. The return value would be a null pointer on failure. The only reason to even use the `errorp` argument would be if you care to know the *reason* allocation failed, but the reason is always some variant of "not enough memory available" anyway so it's rather useless. Normally you would not want to use the reason until *inside* the body of the `if`, e.g. to print the error message. – R.. GitHub STOP HELPING ICE May 15 '13 at 13:33
  • 2
    Agreed, for this function it can work like that, but not for all functions a NULL would be an error. I strongly disagree with your statement about "foolish consistency". In longer run lack of consistency leads to a mess. Compiler error messages can help you catch immediate errors, but hitting those errors in the first place slow the development process and won't help you reading the already existing code! Humans are limited and asking them to remember an inconsistent API won't work. Am I "little mind" to you because of that? I find it bit insulting. – CygnusX1 May 15 '13 at 16:54
  • @CygnusX1: I already described the reason doing allocation functions this way for the sake of consistency is harmful/dangerous: it encourages users of the function to invoke undefined behavior by performing incorrect pointer-type casts. If the allocation function allocated a pointer to a fixed type, the problem would be mitigated, but for functions like this which allocate abstract memory and need to either return `void *` or take a `void **` argument, the latter approach is dangerous and encourages bugs in the caller. – R.. GitHub STOP HELPING ICE May 15 '13 at 17:23
  • 1
    "Casting, as in (void**)&device_array, is invalid C and results in undefined behavior." -> Discussion here : [What's wrong with casting like (void**)&device_array?](http://stackoverflow.com/questions/37671262/whats-wrong-with-casting-like-voiddevice-array) – bct Jun 07 '16 at 11:41
  • @R A simple test to verify your way is correct: suppose the signature of cudaMalloc takes a single pointer as input, just like malloc, and it doesn't output useful info. If this is a problem, write a wrapper which takes a double pointer and adds "useful info" to "make your life easier" -- if this is unreasonable and no one does it, then your way is likely the best. Since it is unreasonable, and no one will ever do it, then lets go ahead and say that your way is the best. – Chris Aug 31 '18 at 19:18
21

In C/C++, you can allocate a block of memory dynamically at runtime by calling the malloc function.

int * h_array;
h_array = malloc(sizeof(int));

The malloc function returns the address of the allocated memory block which can be stored in a variable of some kind of pointer.
Memory allocation in CUDA is a bit different in two ways,

  1. The cudamalloc return an integer as error code instead of a pointer to the memory block.
  2. In addition to the byte size to be allocated, cudamalloc also requires a double void pointer as its first parameter.

    int * d_array cudamalloc((void **) &d_array, sizeof(int))

The reason behind the first difference is that all CUDA API function follows the convention of returning an integer error code. So to make things consistent, cudamalloc API also returns an integer.

There requirements for a double pointer as the function first argument can be understood in two steps.

Firstly, since we have already decided to make the cudamalloc return an integer value, we can no longer use it to return the address of the allocated memory. In C, the only other way for a function to communicate is by passing the pointer or address to the function. The function can make changes to the value stored at the address or the address where the pointer is pointing. The changes to those value can be later retrieved outside the function scope by using the same memory address.

how the double pointer works

The following diagram illustrated how it works with the double pointer.

int cudamalloc((void **) &d_array, int type_size) {
  *d_array = malloc(type_size);
  return return_code;
}

enter image description here

Why do we need the double pointer? Why this does work

I normally live the python world so I also struggled to understand why this will not work.

int cudamalloc((void *) d_array, int type_size) {
  d_array = malloc(type_size);
  ...
  return error_status;
}

So why it doesn't work? Because in C, when cudamalloc is called, a local variable named d_array is created and assigned with the value of the first function argument. There is no way we can retrieve the value in that local variable outside the function's scope. That why we need to a pointer to a pointer here.

int cudamalloc((void *) d_array, int type_size) {
  *d_array = malloc(type_size);
  ...
  return return_code;
}

enter image description here

afp_2008
  • 1,940
  • 1
  • 19
  • 46
Louis T
  • 605
  • 6
  • 15
11

We cast it into double pointer because it's a pointer to the pointer. It has to point to a pointer of GPU memory. What cudaMalloc() does is that it allocates a memory pointer (with space) on GPU which is then pointed by the first argument we give.

jwdmsd
  • 2,107
  • 2
  • 16
  • 30
  • 3
    Your answer explains in simple words when the ** is needed for the first parameter of cudaMalloc. Using your answer I was able to understand the logic behind the double pointer API design. The first dereference will now point me to a pointer that actually points to the data stored in the device memory. A second dereference will actually point me to vector of interest – rgk Jan 20 '15 at 13:39
1

The problem: you have to return two values: Return code AND pointer to memory (in case return code indicates success). So you must make one of it a pointer to return type. And as the return type you have the choice between return pointer to int (for error code) or return pointer to pointer (for memory address). There one solution is as good as the other (and one of it yields the pointer to pointer (I prefer to use this term instead of double pointer, as this sounds more as a pointer to a double floating point number)).

In malloc you have the nice property that you can have null pointers to indicate an error, so you basically need just one return value.. I am not sure if this is possible with a pointer to device memory, as it might be that there is no or a wrong null value (remember: This is CUDA and NOT Ansi C). It could be that the null pointer on the host system is entirely different from the null used for the device, and as such the return of null pointer to indicate errors does not work, and you must make the API this way (that would also mean that you have NO common NULL on both devices).

flolo
  • 15,148
  • 4
  • 32
  • 57