0

I'd like to experiment with MPS on Nvidia GPUs, therefore I'd like to be able to profile two process running in parallel. With the, now deprecated, nvprof, there used to be an option "--profile-all-processes". Is there a equivalent for nsys ?

I tried generating multiple report with MPS OFF and them importing them on the same timeline with this code (from this question) :

#include <stdio.h>
#include <stdlib.h>

#define MAX_DELAY 30

#define cudaCheckErrors(msg) \
  do { \
    cudaError_t __err = cudaGetLastError(); \
    if (__err != cudaSuccess) { \
        fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
            msg, cudaGetErrorString(__err), \
            __FILE__, __LINE__); \
        fprintf(stderr, "*** FAILED - ABORTING\n"); \
        exit(1); \
    } \
  } while (0)


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#define APPRX_CLKS_PER_SEC 1000000000ULL
__global__ void delay_kernel(unsigned seconds){

  unsigned long long dt = clock64();
  while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}

int main(int argc, char *argv[]){
  cudaSetDevice(0);
  unsigned delay_t = 10; // seconds, approximately
  unsigned delay_t_r;
  if (argc > 1) delay_t_r = atoi(argv[1]);
  if ((delay_t_r > 0) && (delay_t_r < MAX_DELAY)) delay_t = delay_t_r;
  unsigned long long difft = dtime_usec(0);
  for (int i = 0; i < 3;i++) {
    delay_kernel<<<1,1>>>(delay_t);
    cudaDeviceSynchronize();
  }
  cudaCheckErrors("kernel fail");
  difft = dtime_usec(difft);
  printf("kernel duration: %fs\n", difft/(float)USECPSEC);
  cudaFree(0);
  return 0;
}

And this script :

nvcc -o t1034 t1034.cu

nsys profile -o rep1 -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -f true -x true ./t1034 &
nsys profile -o rep2 -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -f true -x true ./t1034

I then open rep1.qdrep and add rep2.qdrep to it which produces the following timeline : Nsys Timeline

But I expected someting more like this : Reference

Am I doing something wrong ? Is this the correct result ?

(Sidenote, I'm running this example inside the nvcr.io/nvidia/tensorrt:20.12-py3 docker)

Zois Tasoulas
  • 1,242
  • 1
  • 11
  • 23
Blaizz
  • 21
  • 1
  • 5

1 Answers1

1

I guess your question is why do the kernels from separate processes appear to overlap, even though MPS is off.

The reason for this is due to the low level GPU task/context scheduler (behavior).

It used to be that the scheduler would run one kernel/process/context/task to completion, then schedule another kernel from some waiting process/context/task. In this scenario, the profiler would depict the kernel execution without overlap.

More recently (let's say sometime after 2015 when your reference presentation was created), the GPU scheduler switched to time-slicing on various newer GPUs and newer CUDA versions. This means that at a high level, the tasks appear to be running "concurrently" from the profiler perspective, even though MPS is off. Kernel A from process 1 is not necessarily allowed to run to completion, before the context scheduler halts that kernel in its tracks, does a context-switch, and allows kernel B from process 2 to begin executing for a time-slice.

A side effect of this for an ordinary kernel is that due to time-slicing, those kernels which seem to be running concurrently will usually take longer to run. For your time-delay kernel(s), the time-delay mechanism is "fooled" by the time slicing (effectively the SM clock continues to increment) so they don't appear to take any longer runtime even though time-sliced sharing is going on.

This answer (i.e. the one you already linked) has a similar description.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you for the clarification, is there a way to view this switch in kernel A/B ? – Blaizz Oct 15 '21 at 14:26
  • not that I know of. – Robert Crovella Oct 15 '21 at 14:31
  • Just to make sure I understood; Nowadays what used to be context switching can now happen at the kernel level, even between different context. This switch from kernel A to kernel B is managed by the GPU scheduler (and can't be visualized). MPS is only relevant if kernels don't take 100% of the GPU, correct ? It would allow them to be ran concurrently, and not alternatively ? If I have a kernel that runs for 5s every 5s and another kernel that runs for .5s but every 1s (from different context), is there a way to prioritize the faster kernel ? Or is it already happening ? – Blaizz Oct 15 '21 at 14:42
  • It's still context switching today. What used to be a scheduler heuristic that would switch contexts at kernel boundaries has now changed to one that is time-sliced. Yes, MPS allows kernels (from different processes) to run concurrently (literally at the same instant in time) whereas without MPS there is the time-slicing from one kernel to another, as already discussed. MPS does not *guarantee* concurrency, it allows for the *possibility* of concurrency using this definition I've provided here. As you say, witnessing concurrency still has (resource) requirements for the kernels themselves. – Robert Crovella Oct 15 '21 at 14:47
  • These other questions about kernel priorities are not something I'm going to delve into in the comments. I intended to answer the question posed in the question. There are a lot of questions on these topics already, and your question here was largely already answered in the one you linked. You may wish to do some research. – Robert Crovella Oct 15 '21 at 14:50