Implementing realloc in CUDA without moving data

Question

According to this question and reference NVIDIA CUDA Programming Guide the realloc function is not implemented:

The CUDA in-kernel malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.

The CUDA in-kernel free() function deallocates the memory pointed to by ptr, which must have been returned by a previous call to malloc(). If ptr is NULL, the call to free() is ignored. Repeated calls to free() with the same ptr has undefined behavior.

I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:

__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
    mp_ptr p;

#if __CUDA_ARCH__ // this directive separates device and host code
    /* ? */
#else
    p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif

    if (!p)
        gmp_die("gmp_default_realoc: Virtual memory exhausted.");

    return p;
}

Essentially I can just simply call malloc with new_size, then call memcpy (or maybe memmove), then free previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.

Is there any ~~effective~~ efficient way to implement (standard C or C++) realloc function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc), then in some other place realloc is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).

I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?

`realloc` will [copy data in some cases](http://stackoverflow.com/questions/3476448/how-does-realloc-know-how-much-to-copy). If your question is how do I implement `realloc` (anywhere) without a data copy, for the general case, I don't think it can be done. What is your question, exactly? The word **effective** doesn't really tell me. Stated another way, your question title is this: "Implementing realloc in CUDA without moving data" I would ask Can you do that on the host? Because `realloc` doesn't guarantee that. — Robert Crovella, Aug 21 '14 at 21:40
Exactly I mean scenario when I have already some large block of data (more precisely array of e.g. `unsigned long` objects), then `realloc` is used to obtain more memory. It's simply the case for artibratry precision numbers, where one cannot determine how much memory is needed. I know that C99/C11 standards do not guarantee that data is preserved, but generally it's mostly the case. — Grzegorz Szpetkowski, Aug 21 '14 at 21:45
If you ask for larger memory, `realloc` will often have to do a data copy. I think this claim is doubtful: "I know that C99/C11 standards do not guarantee that data is preserved, but generally it's mostly the case". Even if it's true, not *all* cases can be handled (even in host code) without the need for a data copy in some cases. Therefore I doubt your question is possible (implement realloc without a data copy) whether you are talking host or GPU. — Robert Crovella, Aug 21 '14 at 21:55
I'll say it again: Ask this question on the c or c++ tag: "how to implement realloc without moving data?" Whatever answer you come up with there will likely be instructive for creating a GPU version. I don't think anyone will be able to give you an answer there (either) but I could be wrong. — Robert Crovella, Aug 21 '14 at 21:57
I think you are right with that point. Both allocated blocks from `malloc`, `calloc` or `realloc` have to be *contiguous* and nothing really *guarantess* that larger block will "fit" into available free space (this affects both host and device memory in the same way). — Grzegorz Szpetkowski, Aug 21 '14 at 22:03
There is a reason most memory management APIs designed after the C runtime often do not include realloc()-like functionality. The function interface and failure semantics alone should give pause to developers. "If the function fails to allocate the requested block of memory, a null pointer is returned, and the memory block pointed to by argument ptr is not deallocated (it is still valid, and with its contents unchanged)." i.e. if it fails you don't know whether the old allocation is still valid! http://www.cplusplus.com/reference/cstdlib/realloc/ — ArchaeaSoftware, Aug 22 '14 at 17:36
@ArchaeaSoftware: I see no drawback here. If memory cannot be alocated, then `realloc` returns null pointer (C11 7.22.3.4/4) and old memory is **not** deallocated and value is **unchanged** (C11 7.22.3.4/3). It's programmer's responsibility to check returned value (just like for `malloc`). — Grzegorz Szpetkowski, Aug 22 '14 at 19:08

davlet · Accepted Answer · 2014-08-22T03:49:03.573

Most malloc implementations over-allocate, this is the reason why realloc can sometimes avoid copying bytes - the old block may be large enough for the new size. But apparently in your environment the system malloc doesn't do that, so I think your only option is to reimplement all 3 primitives, gmp_default_{alloc,realloc,free} on top of the system-provided malloc/free.

There are many open-source malloc implementation out there, glibc has one you might be able to adapt.

I'm not familiar with CUDA or GMP, but off the top of my head:

gmp_malloc() followed by plain free() probably works on "normal" platforms, but will likely cause heap corruption if you go ahead with this
if all you want is a more efficient realloc, you can simply overallocate in your custom malloc (up to some size, say the nearest power of 2), just so you can avoid copying in the subseauent re-alloc. You don't even need a full-blown heap implementation for that.
your implementation may need to use a mutex or some such to protect your heap against concurrent modifications
you can improve performance even more if you never (or infrequently) return the malloc()ed blocks back to the OS from within your custom heap, I.e keep the gmp_free()ed blocks around for subsequent reuse instead of calling the system free() on them immediately
come to think of it, a better idea would be to introduce a sane malloc implementation into that platform, outside of your GMP lib, so that other programs and libraries could draw their memory from the same pool, instead of GMP doing one thing and everything else doing something else. This should help with the overall memory consumption w.r.t previous point. Maybe you should port glibc first :)

"nearest multiple of 2" I think you mean the nearest **power** of 2 — phuclv, Aug 22 '14 at 03:44
ah I think I was wrong. Actually it's the nearest multiple of a power of 2 that is that unit for the allocation — phuclv, Aug 22 '14 at 04:27

Implementing realloc in CUDA without moving data

1 Answers1