Does CUDA really not have a calloc()-like API call?

Question

From looking at the CUDA 5.5 API Reference and the CUDA C Programming Guide it seems that there is no cudaCalloc(), an on-GPU equivalent of the standard C library's calloc().

Is there really no API functionality for allocating a buffer initialized to all-zeros?
Is there something better I can do than call cudaMalloc() and then cudaMemset()?

A compiler library probably implements calloc as a wrapper around malloc + memset anyhow. — Lundin, Jan 20 '14 at 12:21
@Lundin: I would say probably not. You can allocate zero'ed memory blocks without actually setting any data, just by marking the blocks empty (e.g. not mapped from virtual to physical memory, and writing to them causes a page fault). Of course this depends on what machine you're on. I'm not sure how memory zero'ing happens on GPUs, but it may well be the case that you can do better than malloc+memset. — einpoklum, Jan 20 '14 at 12:52
@JaredHoberock: Can you make that an answer and explain why using `thrust::device_vector` is a good idea in this context? — einpoklum, Jan 21 '14 at 09:36
Does anyone among the answers below satisfy you? If yes, please accept it. You have the bad habit of launching a stone and hiding the hand. — Vitality, Jan 30 '14 at 13:35
@JackOLantern: I don't really like the macro in RobertCrovella's answer, but the answer is basically 'No', and those are the API calls for the workaround, so... — einpoklum, Jan 30 '14 at 15:30

score 13 · Accepted Answer · edited May 23 '17 at 12:10

13

Is there really no API functionality for allocating a buffer initialized to all-zeros?

There really is not.

Is there something better I can do that cudaMalloc() followed by cudaMemset()?

You could use a macro, if it's a matter of convenience (you haven't told us what you mean by better, if the answer to the first question is no):

#define cudaCalloc(A, B, C) \
    do { \
        cudaError_t __cudaCalloc_err = cudaMalloc(A, B*C); \
        if (__cudaCalloc_err == cudaSuccess) cudaMemset(*A, 0, B*C); \
    } while (0)

The above macro will work with the kind of error checking I usually do (which is based on using cudaGetLastError(); or you can build your preferred error checking directly into the macro, if you like. See this question about error handling.

edited May 23 '17 at 12:10

Community

1
1

answered Jan 20 '14 at 14:31

Robert Crovella

143,785
11
213
257

1

Revisiting your answer - why a macro rather than a function marked `inline`? – einpoklum May 22 '17 at 10:41
You could do that instead. – Robert Crovella May 22 '17 at 13:58
1

I was thinking perhaps, in our more civilized age (well, for programming anyway) it would be better to edit your answer to recommend that instead. Macros should not be encouraged unless absolutely necessary IMO. By the way, Nikolay Sakharnykh says hi. Or rather, I mentioned your name and he acknowledged it :-) – einpoklum May 22 '17 at 17:41
why don't you add an answer? Then you'll get the credit for being more civilized. I would upvote it. You can even un-accept this one and accept your own. – Robert Crovella May 22 '17 at 17:56

score 1 · Answer 2 · answered Jan 21 '14 at 23:10

1

If all you want is a simple way to zero out new allocations, you can use thrust::device_vector, which default constructs its elements. For primitive types, this is the same behavior as calloc.

answered Jan 21 '14 at 23:10

Jared Hoberock

11,118
3
40
76

thrust functions should only be used when planning to use thrust vectors all along our program. Thrust library functionalities are still limited today compared to raw vector possibilities. – Dimitri Lesnoff Feb 02 '23 at 16:01

score 1 · Answer 3 · edited Feb 03 '23 at 03:21

1

Here is a solution with an inline function. devPtr is supposed to be a pointer to pointer to anything. Using a void* as function argument releases the caller from applying a cast.

inline cudaError_t
_cuda_calloc( void *devPtr, size_t size )
{
  cudaError_t err = cudaMalloc( (void**)devPtr, size );
  if( err == cudaSuccess ) err = cudaMemset( *(void**)devPtr, 0, size );
  return err;
}

edited Feb 03 '23 at 03:21

Nicol Bolas

449,505
63
781
982

answered Mar 01 '22 at 07:29

Claas Bontus

1,628
1
15
29

Already suggested. – einpoklum Mar 01 '22 at 07:53
Suggested but not explicitly written. All SO posts that are CUDA related still write macros rather than inline. Unless C compatibility is required, there is no reason to keep examples of this dangerous habit. – Dimitri Lesnoff Feb 02 '23 at 15:02
My edit has been approved and rollbacked immediately after, without fixing the issues in this answer. – Dimitri Lesnoff Feb 13 '23 at 11:30
My edit has been approved and rollbacked immediately after, without fixing the issues in this answer. It is not the fact that we use a void* that releases the caller from applying a cast, it is just not necessary since void and any level of indirection on a void function argument, accept any other type without a cast during function call. snake_case has been used, while all other CUDA functions uses camelCase. "supposed to be" is quite an imprecise formulation. It was heavily inspired by another answer, and did not give any credits neither. And devPtr is not a pointer to another pointer. – Dimitri Lesnoff Feb 13 '23 at 11:40
`cudaMalloc` modifies its first argument. It therefore expects a pointer to pointer variable. We want to call the function like `_cuda_calloc(&p,...)` where `p` can be a pointer to int or to float or to anything. If `devPtr` were declared as `void**` calling the function would require a cast. See the link in the answer. What is imprecise with "supposed to be"? – Claas Bontus Feb 13 '23 at 14:52

score 0 · Answer 4 · edited Dec 04 '18 at 16:31

0

There is no calloc()-like functionality in the CUDA Runtime API, nor another, lower-level equivalent. Instead, you can do the following:

cudaMalloc(&ptr, size);
cudaMemset(ptr, 0, size);

note that this is all synchronous. There's a cudaMemsetAsync() as well, although, frankly, cudaMalloc()s are currently slow enough that it doesn't really matter.

edited Dec 04 '18 at 16:31

einpoklum

118,144
57
340
684

answered Jan 20 '14 at 13:40

TripleS

1,216
1
23
39

2

Using Memcpy to zero a buffer is a rather bad idea, I think. – einpoklum Jan 20 '14 at 14:27
`cudaMemset()` runs asynchronously with the host anyway (see the [ref manual](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge07c97b96efd09abaeb3ca3b5f8da4ee)). – Tom Jan 20 '14 at 23:05
1

Not Ture, the function exhibit synchronize behavior for most cases – TripleS Jan 21 '14 at 05:06

Does CUDA really not have a calloc()-like API call?

4 Answers4