3

CUDA programming guide states that "Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)", but somehow I'm having trouble to reproduce this functionality. Code:

#include <cstdio>
__device__ int* p;

__global__ void allocate_p() {
  p = (int*) malloc(10);
  printf("p = %p  (seen by GPU)\n", p);
}

int main() {
  cudaError_t err;
  int* localp = (int*) malloc(10);

  allocate_p<<<1,1>>>();
  cudaDeviceSynchronize();

  //Getting pointer to device-allocated memory
  int* tmpp = NULL;
  cudaMemcpyFromSymbol(&tmpp, p, 4);
  printf("p = %p  (seen by CPU)\n", tmpp);

  //cudaMalloc((void**)&tmpp, 40);
  err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
  cudaDeviceSynchronize();
  printf(" err:%i %s", (int)err, cudaGetErrorString(err));

  delete localp;
  return 0;
}

crashes with output:

p = 0x601f920  (seen by GPU)
p = 0x601f920  (seen by CPU)
 err:11 invalid argument

I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().

If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.

What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?

P Marecki
  • 1,108
  • 15
  • 19
  • The section you are quoting is from section "B.17.2 Interoperability with Host Memory API". malloc() is referring to host malloc() not device malloc(). This is definitely confusing given it follows the section "B.17.1 Heap Memory Allocation" that introduces device malloc(). – Greg Smith Sep 04 '12 at 03:20
  • @GregSmith: I can't understand how you reached that interpretation. The section is about device dynamic memory allocation (there are even code samples showing kernel malloc calls), and the reference is to host memory API functions described in Section 3.2.2, ie. cudaMemcpy. Where does *host* malloc come into the picture? I wouldn't say it is confusingly written, I would say it is just plain wrong (or there is a massive bug in the runtime as the alternative). – talonmies Sep 04 '12 at 16:25
  • @talonmies Yes, the section is wrong in older manuals and was copied for a different location. The statement should be removed from the next version of the programming guide. – Greg Smith Sep 04 '12 at 17:22

1 Answers1

3

As far as I am aware, it isn't possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to "gather" final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.

Community
  • 1
  • 1
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Another workaround would be to not use `__device__ malloc()` at all, and instead allocate the memory up front and index into it in the kernel, right? Especially if the `__device__ malloc()` is of constant size, as in the example. Also, I think that `__device__ malloc()` may cause implicit serialization of threads that make the call at the same time (so that each of them can be assigned its own chunk from the heap). – Roger Dahl Sep 03 '12 at 17:12
  • @RogerDahl: In most cases I would agree. There are some cases where is it better to use a "setup" kernel to build data structures in heap with kernel malloc and then have the device code work purely on that structure in heap, then lastly copy the final outcome back to the host in a "pull down/gather" kernel. Deeply recursive data structures are basically impossible to allocate and initialise only using the host API. – talonmies Sep 03 '12 at 18:02
  • Thanks; this was driving me mad (given what the guide says). As I have the device-adresses on host (in `tmpp`), I tried shifting other pointer obtained from cudaMalloc() to point to malloc()-allocated device-memory. (This was stupid, but I thought that perhaps CUDA had some RTTI distinguishing the pointers.) Apparently host-side allocator range-checks memory operations to what it explicitly allocated. Device-side allocator, on the other hand, is pretty lax and would not even bother reporting `free(s)` of an non-allocated `s`. I'm testing this on CUDA 5.0 with `-arch=sm_21`. – P Marecki Sep 03 '12 at 19:09
  • @talonmies: "transfer buffer" - OK, but what do you mean by "zero copy memory"? (Buffer will eat device memory, wouldn't it?) – P Marecki Sep 03 '12 at 19:16
  • 1
    @PMarecki: Zero copy is mapped host memory which the GPU can directly access over the PCI-e bus without the need for an explicit API copy (thus the name "zero copy"). – talonmies Sep 03 '12 at 19:21
  • @RogerDahl: Very interesting thought on threading-aspect of `malloc`. Any info on device-allocator inner workings will be appreciated. So far I see that memory ranges allocated by `malloc` and `cudaMalloc` tend to be physically separated (and for some reason default range for `malloc` is only 8Mb [expandable]). – P Marecki Sep 03 '12 at 19:21
  • @PMarecki can you please provide a specific version/section/page reference in the CUDA programming guide to the quote you provided? I want to make sure any mistakes in the documentation are fixed. Thanks! – harrism Sep 04 '12 at 01:01
  • 1
    @harrism: Section B.17.2 of the CUDA 4.2 Programming Guide, last sentence. I had never noticed it before I when searching.... – talonmies Sep 04 '12 at 05:59
  • @harrism: as talonmies says; the sentence is also in 5.0 PG. – P Marecki Sep 04 '12 at 06:22