1

I'm involved in effort integrating CUDA into some existing software. The software I'm integrating into is pseudo real-time, so it has a memory manager library that manually passes pointers from a single large memory allocation that is allocated up front. CUDA's Unified Memory is attractive to us, since in theory we'd theoretically be able to change this large memory chunk to Unified Memory, have the existing CPU code still work, and allow us to add GPU kernels with very little changes to the existing data I/O stream.

Parts of our existing CPU processing code requires memory to be aligned to certain alignment. cudaMallocManaged() does not allow me to specify the alignment for memory, and I feel like having to copy between "managed" and strict CPU buffers for these CPU sections almost defeats the purpose of UM. Is there a known way to address this issue that I'm missing?

I found this link on Stack Overflow that seems to solve it in theory, but I've been unable to produce good results with this method. Using CUDA 9.1, Tesla M40 (24GB):

#include <stdio.h>
#include <malloc.h>
#include <cuda.h>

#define USE_HOST_REGISTER 1

int main (int argc, char **argv)
{
   int num_float = 10;
   int num_bytes = num_float * sizeof(float);

   float *f_data = NULL;

   #if (USE_HOST_REGISTER > 0)
   printf(
      "%s: Using memalign + cudaHostRegister..\n",
       argv[0]);

   f_data = (float *) memalign(32, num_bytes);

   cudaHostRegister(
      (void *) f_data,
      num_bytes,
      cudaHostRegisterDefault);
   #else
   printf(
      "%s: Using cudaMallocManaged..\n",
       argv[0]);

   cudaMallocManaged(
      (void **) &f_data,
      num_bytes);
   #endif

   struct cudaPointerAttributes att;
   cudaPointerGetAttributes(
      &att,
      f_data);

   printf(
      "%s: ptr is managed: %i\n",
       argv[0],
       att.isManaged);
   fflush(stdout);

   return 0;
}

When using memalign() + cudaHostRegister() (USE_HOST_REGISTER == 1), the last print statement prints 0. Device accesses via kernel launches in larger files unsurprisingly report illegal accesses.

When using cudaMallocManaged() (USE_HOST_REGISTER == 0), the last print statement prints 1 as expected.

edit: cudaHostRegister() and cudaMallocManaged() do return successful error codes for me. Left this error-checking out in my sample I shared, but I did check them during my initial integration work. Just added the code to check, and both still return CUDA_SUCCESS.

Thanks for your insights and suggestions.

talonmies
  • 70,661
  • 34
  • 192
  • 269
mfeuling
  • 51
  • 6
  • 2
    there is no way to convert an existing allocation to a managed allocation. I don't really understand your concerns around managed memory buffer alignment, but the base address of a managed memory allocation should be aligned at least to a 128 byte boundary. – Robert Crovella May 01 '19 at 03:51
  • Robert, thanks for your input. As far as clarifying my concerns, some of my CPU code that would be operating on memory now allocated with cudaMallocManaged() instead of memalign() had some non-standard memory alignments ("32" was just a simple example in the code I linked). I wanted a way to have this CPU code still work as-is with cudaMallocManaged() memory, and thought via the SO link I shared this was possible with cudaHostRegister(). – mfeuling May 01 '19 at 15:15
  • 1
    `cudaHostRegister` doesn't have anything to do with `cudaMallocManaged` or managed memory. It is used to pin host memory, which is not the same thing as managed memory. – Robert Crovella May 01 '19 at 18:50
  • 1
    [I would expect](https://stackoverflow.com/questions/14082964/cuda-alignment-256bytes-seriously) the address returned by `cudaMallocManaged` to satisfy alignment requirements of 1, 2, 4, 8, 16, 32, 64, and 128 bytes. Probably also 256 and 512 bytes. – Robert Crovella May 02 '19 at 18:56

1 Answers1

4

There is no method currently available in CUDA to take an existing host memory allocation and convert it into a managed memory allocation.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257