Questions tagged [nsight-compute]

A GPU kernel & API call profiling tool for the CUDA environment, with command-line and a GUI aspects; replaces parts of the older nvprof and NVIDIA Visual Profiler tools, and complements NSight Systems.

34 questions
6
votes
1 answer

What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. Two of the items in this taxonomy are: Short scoreboard - scoreboard dependency on an MIO queue operation. Long scoreboard -…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
5
votes
1 answer

When does MIO Throttle stall happen?

According to this link https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html: Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the…
rm95
  • 167
  • 6
5
votes
1 answer

How does Nsight Compute determine/display shared memory metrics?

I'm learning about __shared__ memory in CUDA, and I'm confused about how Nsight Compute shows shared memory statistics. I'm going through this article (code available on Nvidia's github here, but copied below for reference). #include…
Daniel A. Thompson
  • 1,904
  • 1
  • 17
  • 26
4
votes
0 answers

L2 Fabric cache hit rate of CUDA kernels on A100

I am profiling a read-only kernel in Nsight Compute on A100. The kernel is very simple and the complete code is as below. #include #include #include #include const int BLOCK = 1024; const int…
Shulai
  • 41
  • 2
4
votes
1 answer

Nsight Compute says: "Profiling is not supported on this device" - why?

I have a machine with an NVIDA GTX 1050 Ti GPU (compute capability 6.1), and am trying to profile a kernel in a program I built with CUDA 11.4. My OS distribution is Devuan GNU/Linux 4 Chimaera (~= Debian 11 Bullseye). NSight Compute starts my…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
4
votes
1 answer

Interpreting compute workload analysis in Nsight Compute

Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which…
heapoverflow
  • 264
  • 2
  • 12
4
votes
2 answers

CUDA : How to detect shared memory bank conflict on device with compute capabiliy >= 7.2?

On device with compute capability <= 7.2 , I always use nvprof --events shared_st_bank_conflict but when i run it on RTX2080ti with CUDA10 , it returns Warning: Skipping profiling on device 0 since profiling is not supported on devices with…
LiMou
  • 41
  • 3
2
votes
1 answer

Unbalanced Memory Read & Write in CUDA

I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu. __global__ void kernel(void* mem, int n) { int* ptr = reinterpret_cast(mem); for (int offset = (threadIdx.x + blockIdx.x *…
Alex Chen
  • 33
  • 2
2
votes
0 answers

With the NSight Compute profiler, can I check cache hit rates for a specific region of memory?

My GPU kernel reads data from different input buffers. I want to check whether I manage to get cache hits for the reads from one of these buffers. Is it possible to limit the counting of cache hit/miss metrics to a particular range of memory…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

How to profile in CUDA application with compute capability 7.x? Is metric "dram_read_throughput" valid in Nsight Compute?

My setup environment: CUDA 10.2 Device: RTX 2080 OS: Ubuntu 16.04 When I try to use nvprof, I find that it doesn't support devices with compute capability 7.2 and higher. It is recommended that I should use Nsight Compute or Nsight Systems…
fishmingee
  • 21
  • 1
  • 2
1
vote
1 answer

CUDA math function register usage

I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated. I'm using the following program: #include…
Chris Uchytil
  • 140
  • 1
  • 11
1
vote
1 answer

Roofline Model with CUDA Manual vs. Nsight Compute

I have a very simple vector addition kernel written for CUDA. I want to calculate the arithmetic intensity as well as GFLOP/s for this Kernel. The values I calculate differ visibly from the values obtained by Nsight Compute's Roofline Analysis…
Cherry Toska
  • 131
  • 8
1
vote
2 answers

ncu-ui won't run: Could not load the Qt platform plugin "xcb" in "" even though it was found

I'm trying to run the ncu-ui profiler GUI on a CentOS 7 Linux system (using ncu-ui 2022.1), both as root and as a regular user. I'm getting the error: qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found. This…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Shared memory loads not registered when using Tensor Cores

I'm trying to multiply blocks of size 8x8 using Tensor Cores on a GPU designed with the Turing architecture. For that I'm using the WMMA API and fragments of size 16x16. My assumption was that shared memory bandwidth would be wasted since most data…
rm95
  • 167
  • 6
1
vote
2 answers

What does NSight Compute show for a stall reason that isn't "supported"?

The CUDA Profiling Guide lists various reasons for sampled warp stalls, e.g. Allocation, Barrier, LG Throttle etc. And - the NSight Compute profiler shows the distribution of these as part of the profiling results. The thing is, some of the stall…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
2 3