0

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:

int aPayloads[32];
// Compute aPayloads start values here

int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));

mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();

// Access results in payload array here

Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).

talonmies
  • 70,661
  • 34
  • 192
  • 269
Silicomancer
  • 8,604
  • 10
  • 63
  • 130

1 Answers1

3
  1. You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:

    mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
    cudaDeviceSynchronize();
    int myval = aGlobalPayloads[0];
    
  2. I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.

  3. If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:

    memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
    int myval = aPayloads[0]; // retrieves the same value
    

    (that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks a lot! In the actual code of course I will do full error checking. Should I expect a performance difference between accessing aGlobalPayloads int-by-int vs copying them to local memory en bloc and accessing aPayloads int-by-int? (I would expect a burst transfer to be more efficient but I don't know much about device-internal memory transfers) – Silicomancer Feb 16 '22 at 11:00
  • I would expect the access to `aGlobalPayloads` to be faster than copying it to local memory first, but really that's just a guess. If this is very important to you, then you may wish to do some micro-benchmarking. The copy operation is not a "burst transfer" whatever that means. It is a loop, just like you would write, that copies things byte-by-byte. Direct access should be quicker than that byte-by-byte copy, followed by local access. – Robert Crovella Feb 16 '22 at 14:17
  • Well, burst transfer or burst mode describes the situation where a data transfer duration consists of an per-transfer-penalty and a penalty per byte. Such transfers tend to be much faster when transfering bigger portions of data. A lot of architectures and protocols have burst transfers: https://en.wikipedia.org/wiki/Burst_mode_(computing) – Silicomancer Feb 16 '22 at 16:12