1

I have got an object say d_obj that has some members on the unified memory and some members explicitly on the device memory. I then call a CUDA kernel that takes the object and works with it. I would like to immediately have CPU do some stuff with the members on the unified memory right after the kernel call, but that fails. Here I reproduce my problem using a short code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"    
#include <stdio.h>

#define CHECK_CUDA(call)                                            \
{                                                                   \
const cudaError_t error = call;                                     \
if (error != cudaSuccess)                                           \
{                                                                   \
printf("ERROR:: File: %s, Line: %d, ", __FILE__, __LINE__);         \
printf("code: %d, reason: %s\n", error, cudaGetErrorString(error)); \
exit(EXIT_FAILURE);                                                 \
}                                                                   \
}

class MyClass
{
public:
    MyClass(int n_) : n(n_) { }
    void allocateMeOnDevice() {
        CHECK_CUDA(cudaMalloc((void**)&vec, n * sizeof(float)));
    }
    int n;
    float* vec;
};

__global__ void kernel(MyClass* obj) {
    for (int i = 0; i < obj->n; i++) {
        obj->vec[i] = 1;
    }
}

int main() {
    
    int n = 1000;

    MyClass h_obj(n); 

    MyClass* d_obj;
    CHECK_CUDA(cudaMallocManaged((void**)&d_obj, sizeof(MyClass)));
    CHECK_CUDA(cudaMemcpy(d_obj, &h_obj, sizeof(MyClass), cudaMemcpyHostToDevice));
    d_obj->allocateMeOnDevice();

    kernel << <1, 1 >> > (d_obj);

    //CHECK_CUDA(cudaDeviceSynchronize()); 
    printf("** d_obj->n is %d\n", d_obj->n); // <-- Read access violation if the above line is commented out    

}

Is it not possible to access something on the unified memory from both host and device at the same time? I am wondering if there is any workaround for this problem?

OS: Windows 10/ CUDA 11.2/ Device: GeForce RTX 3090

If_You_Say_So
  • 1,195
  • 1
  • 10
  • 25
  • 2
    on windows it is not possible to access unified memory from host and device at the same time. The `cudaDeviceSynchronize()` call is mandatory after launching a kernel, before accessing unified memory from host code. There is no workaround that allows you to access unified memory from host and device at the same time on windows. One possible workaround is to switch to linux. This is covered in the unified memory section of the CUDA programming guide. – Robert Crovella Apr 15 '21 at 19:56
  • I wish there was a workaround on Windows as switching to Linux is not an option in this case. Anyway thanks for your help. At least now I know that this is a limitation I have to work with. – If_You_Say_So Apr 15 '21 at 20:28
  • 1
    Another possible workaround to access a memory resource on windows from both host and device is to use zero-copy techniques, i.e. host-pinned memory. That is used [here](https://stackoverflow.com/questions/20345702/how-can-i-check-the-progress-of-matrix-multiplication/20381924#20381924) for example, where the general matrix multiplication uses ordinary device memory, but the communication happens via pinned memory. However, doing *large-scale* operations on a pinned resource from CUDA device code is probably going to be disappointing, performance-wise. – Robert Crovella Apr 15 '21 at 21:42

1 Answers1

7

Under windows, and any recent version of CUDA (say, 9.0 or newer), unified memory (or managed memory - synonym) behavior is indicated as:

Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.

Later, the documentation indicates that for such systems, it is necessary, after a kernel launch, to issue a cudaDeviceSynchronize(), before managed data can be accessible to the CPU again.

If you fail to do that on windows, you will hit seg faults in CPU code trying to access any managed data. Programmatically you can check the need for the need for this type of synchronization after a kernel launch using the concurrentManagedAccess property which is covered in the documentation. You could use cudaDeviceGetAttribute() for this:

int cmm = 0;
int device_to_check = 0;
cudaDeviceGetAttribute(&cmm, cudaDevAttrConcurrentManagedAccess, device_to_check);
if (cmm) {
    //cmm will be true/non-zero if it is safe to not use `cudaDeviceSynchronize()` after a kernel call
    }
else {
    //cmm is zero, this is the windows case...
    }

Some possible workarounds:

  • switch to Linux (assuming your GPU is cc6.x or higher)
  • use host-pinned ("zero-copy") memory, rather than managed memory. For bulk or large-scale data access, however, this will likely have performance ramifications.

Note that WSL is also considered a windows platform for managed memory usage.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257