1

I am profiling a test code presented in the Unified Memory for CUDA Beginners on NVIDIA's developer forum.

Code:

#include <iostream>
#include <math.h>

// CUDA kernel to add elements of two arrays
__global__
void add(int n, float* x, float* y)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

int main(void)
{
    int N = 1 << 20;
    float* x, * y;

    // Allocate Unified Memory -- accessible from CPU or GPU
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    // Launch kernel on 1M elements on the GPU
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add << <numBlocks, blockSize >> > (N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

QUESTION: The results of the profiling presented by the author shows information about "Page Faults" but when I run the nvprof and nvvp profilers, I do not get any information about page faults. Is there any flag or something that needs to be explicitly set to get that information?

My nvprof output:

== 20160 == Profiling result :
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities : 100.00 % 60.513us         1  60.513us  60.513us  60.513us  add(int, float*, float*)
API calls : 81.81 % 348.14ms         2  174.07ms  1.5933ms  346.54ms  cudaMallocManaged
16.10 % 68.511ms         1  68.511ms  68.511ms  68.511ms  cuDevicePrimaryCtxRelease
1.34 % 5.7002ms         1  5.7002ms  5.7002ms  5.7002ms  cudaLaunchKernel
0.66 % 2.8192ms         2  1.4096ms  1.0669ms  1.7523ms  cudaFree
0.07 % 277.80us         1  277.80us  277.80us  277.80us  cudaDeviceSynchronize
0.01 % 33.500us         3  11.166us  3.5000us  16.400us  cuModuleUnload
0.00 % 19.800us         1  19.800us  19.800us  19.800us  cuDeviceTotalMem
0.00 % 16.700us       101     165ns     100ns     900ns  cuDeviceGetAttribute
0.00 % 9.2000us         3  3.0660us     200ns  8.2000us  cuDeviceGetCount
0.00 % 3.1000us         1  3.1000us  3.1000us  3.1000us  cuDeviceGetName
0.00 % 2.1000us         2  1.0500us     300ns  1.8000us  cuDeviceGet
0.00 % 300ns         1     300ns     300ns     300ns  cuDeviceGetLuid
0.00 % 200ns         1     200ns     200ns     200ns  cuDeviceGetUuid

== 20160 == Unified Memory profiling result :
Device "GeForce GTX 1070 (0)"
Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
64  128.00KB  128.00KB  128.00KB  8.000000MB  3.217900ms  Host To Device
146  84.164KB  32.000KB  1.0000MB  12.00000MB  68.17800ms  Device To Host

My nvvp Profiling Result:

enter image description here

skm
  • 5,015
  • 8
  • 43
  • 104

1 Answers1

1

The operating system matters.

You are on windows, and the CUDA Unified Memory (UM) system works quite a bit differently on windows as compared to linux, when pascal or newer devices are in view.

On windows, page faults are not the mechanism that the UM system uses to determine when to migrate data, and so they are not reported in or by the profiler.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • ok, then, how does the physical data move in case of Windows? The link that you have provided does not tell about the data migration process followed on Windows. Furthermore, most of the articles and even book explains the concept of Unified Memory and data migration considering Page Faulting mechanism. Is there any resource where I can get further detailed information on this topic for Windows Os? – skm Nov 29 '21 at 14:26
  • The link I provided explains it. Specifically it says: "Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher. See Data Migration and Coherency for details." Which says: "GPU architectures of compute capability lower than 6.x do not support fine-grained movement of the managed data to GPU on-demand. Whenever a GPU kernel is launched all managed memory generally has to be transfered to GPU memory to avoid faulting on memory access." – Robert Crovella Nov 29 '21 at 14:33
  • That is as much explanation or specification as you will get from NVIDIA documentation, as far as I know. Effectively, all managed data is transferred to the GPU at the point of kernel launch. After a kernel launch, this regime requires a `cudaDeviceSynchronize()` after any kernel launch, before managed data is again accessible to the host CPU. The transfer in the Device-to-host direction is triggered by that `cudaDeviceSynchronize()` call. – Robert Crovella Nov 29 '21 at 14:35
  • Thanks, I already read those sentence on the link that you provided. But I am hoping to get more information. While we have such long blogs and chapters dedicated to explain the process of UM using Page Fault mechanism, I was expecting to see the same level of information for a Windows OS. – skm Nov 29 '21 at 14:43
  • I've given you what I believe is a useful mental model for a programmer: 1. " Effectively, all managed data is transferred to the GPU at the point of kernel launch. " 2. " The transfer in the Device-to-host direction is triggered by (a) cudaDeviceSynchronize() call. " I'm not sure what else is needed. The mechanism is far simpler than the demand-paged mechanism. It doesn't require pages and pages to describe it, at the level that is useful for the CUDA programmer. – Robert Crovella Nov 29 '21 at 14:48
  • " 1. " Effectively, all managed data is transferred to the GPU at the point of kernel launch. " 2. " The transfer in the Device-to-host direction is triggered by (a) cudaDeviceSynchronize() call. "", This is more or less the same procedure/order followed by a program written using memory-copy calls (i.e. `cudaMemcpy()` then, what is exactly the benefit of UM on Windows...only to avoid the calls of `cudaMemcpy()` explicitly? – skm Nov 29 '21 at 15:01
  • Yes, it is similar to what you would do with `cudaMalloc`/`cudaMemcpy`. I think I have answered your original question here. You seem to want a tutorial on managed memory. I won't be able to do that in the comments. I can certainly demonstrate a programming case where using explicit `cudaMalloc`/`cudaMemcpy` would be extremely tedious (e.g. transfer of a doubly-linked list from host to device) and by comparison `cudaMallocManaged` makes it trivially simple. UM is not first and foremost about performance, it is about programmer productivity. – Robert Crovella Nov 29 '21 at 15:12
  • You can find a tutorial on UM [here](https://www.olcf.ornl.gov/calendar/cuda-managed-memory/) – Robert Crovella Nov 29 '21 at 15:14
  • Thanks for presenting a good use case for UM. It helped to know that `UM is not first and foremost about performance, it is about programmer productivity.` – skm Nov 29 '21 at 15:21