0

I have a program that works well in linux (Ubuntu 22.04) with NVCC (11.5) that uses managed memory. When I bring this code to a Windows machine with WSL the portion of the code that actually uses the Managed Memory crashes with a segmentation fault.

Below is the full code, I can allocate Managed memory but as soon as I try to use it as such (e.g. by using memcpy instead of cudaMemcpy) the program doesn't work.

What I want to know if this is a limitation of WSL2 or the way I have configured my WSL2.

I am using the Windows Nvidia driver 516.59 in a Quadro RTX 5000 Max-Q (Mobile), Windows 11 Pro 10.0.22000 Build 22000, nvcc V11.5.119.

driver

I compile this program doing nvcc -ccbin=g++-10 a.cu && ./a.out.

#include <stdio.h>

#include<algorithm>

#define N 1000
__global__
void add(int *a, int *b) {
    int i = blockIdx.x;
    if (i<N) {
        b[i] = 2*a[i];
    }
}

int main() {
    int ha[N], hb[N];

    int *da, *db;
    cudaMallocManaged((void **)&da, N*sizeof(int));
    cudaMallocManaged((void **)&db, N*sizeof(int));

    for (int i = 0; i<N; ++i) {
        ha[i] = i;
    }

    cudaMemcpy(da, ha, N*sizeof(int), cudaMemcpyHostToDevice);

    add<<<N, 1>>>(da, db);

//  cudaMemcpy(hb, db, N*sizeof(int), cudaMemcpyDeviceToHost);  // this is ok in WSL2 and linux
    memcpy(hb, db, N*sizeof(int));  // this produces a seg fault in WSL2 but not in linux

    for (int i = 0; i<N; ++i) {
        printf("%d\n", hb[i]);
    }

    cudaFree(da);
    cudaFree(db);

    return 0;
}

More details:

$ nvidia-smi
Fri Jul 22 03:56:12 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 516.59       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   47C    P8    10W /  N/A |   3323MiB / 16384MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
alfC
  • 14,261
  • 4
  • 67
  • 118
  • 1
    As can be found in the docs [here](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps), WSL has the same limitations in terms of unified memory as native Windows. I would have thought that these limitations are mostly causing worse performance (see [here](https://stackoverflow.com/q/72879821/10107454)), so I'm not 100% sure there isn't some additional config problem causing the segfault. – paleonix Jul 22 '22 at 13:02
  • 3
    No error checking? And as indicated above, you need a cudaDeviceSynchronize after the kernel call, before the memcpy. – Robert Crovella Jul 22 '22 at 13:02
  • Adding a `cudaDeviceSynchronize` after the kernel launch solved the problem. How do I execute `cudaDeviceSynchronize` only if needed by the platform? I am afraid that adding it when it is not necessary will make the code slower in Linux. Is `if( cuda_Something_GetVersion() ) cudaDeviceSynchronize();` a good workaround? – alfC Jul 22 '22 at 17:51
  • 1
    programmatically you can check the `concurrentManagedAccess` property which is covered in the [documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd). – Robert Crovella Jul 23 '22 at 13:56
  • I realize the code you provided is just an example, but the pattern has a hazard on linux. If you don't have any host/device synchronization after the kernel call, and immediately `memcpy` the data referenced by `db` back to the host, it is not guaranteed to reflect the updates from the kernel. – Robert Crovella Jul 23 '22 at 14:03
  • 1
    I've added a testing example to the linked duplicate for concurrent (host/device) managed access. – Robert Crovella Jul 23 '22 at 14:14

0 Answers0