According to this question and reference NVIDIA CUDA Programming Guide the realloc
function is not implemented:
The CUDA in-kernel
malloc()
function allocates at leastsize
bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.The CUDA in-kernel
free()
function deallocates the memory pointed to byptr
, which must have been returned by a previous call tomalloc()
. Ifptr
is NULL, the call tofree()
is ignored. Repeated calls tofree()
with the sameptr
has undefined behavior.
I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:
__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
mp_ptr p;
#if __CUDA_ARCH__ // this directive separates device and host code
/* ? */
#else
p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif
if (!p)
gmp_die("gmp_default_realoc: Virtual memory exhausted.");
return p;
}
Essentially I can just simply call malloc
with new_size
, then call memcpy
(or maybe memmove
), then free
previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.
Is there any effective efficient way to implement (standard C or C++) realloc
function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc
), then in some other place realloc
is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).
I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?