How does Nsight Compute determine/display shared memory metrics?

Question

I'm learning about __shared__ memory in CUDA, and I'm confused about how Nsight Compute shows shared memory statistics.

I'm going through this article (code available on Nvidia's github here, but copied below for reference).

#include <stdio.h>

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

__global__ void dynamicReverse(int *d, int n)
{
  extern __shared__ int s[];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

int main(void)
{
  const int n = 64;
  int a[n], r[n], d[n];

  for (int i = 0; i < n; i++) {
    a[i] = i;
    r[i] = n-i-1;
    d[i] = 0;
  }

  int *d_d;
  cudaMalloc(&d_d, n * sizeof(int)); 

  // run version with static shared memory
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  staticReverse<<<1,n>>>(d_d, n);
  cudaMemcpy(d, d_d, n*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)\n", i, i, d[i], r[i]);

  // run dynamic shared memory version
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
  cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)\n", i, i, d[i], r[i]);
}

When I run Nsight Compute, I see the following chart for the staticReverse kernel (the dynamicReverse kernel is almost identical):

Question 1: The chart shows 1 request to and 1 request from shared memory, but why does it also show 0 shared memory instructions? Does a request not count as an instruction? From this chart's perspective, what counts as a shared memory instruction?

Next, in the source view, Nsight Compute shows line-by-line counts for various metrics:

Question 2: Why does "Memory L1 Transactions Shared" show 0 for lines 8 and 10? I was expecting to see:

Line 8: an equal number of [load transactions from global memory] and [store transactions to shared memory]
Line 10: an equal number of [load transactions from shared memory] and [store transactions to global memory]

Question 3: Why are there 8 memory transactions each for lines 8 and 10?

My system:

Ubuntu 18.04 LTS
GeForce 1070 (Pascal)
CUDA Version: 10.2
Driver Version: 440.64.00

I was going to study the PTX files and link them to the reported Nsight performance metrics for the two kernels here in an attempt to answer your question but realized that I actually cannot reproduce the same memory chart as yours. In my case, I see the same number of shared memory _instructions_ as those of the global memory for both kernels. I have CUDA 10.2 and compiled the code against CC 6.1 but on Windows 10. — If_You_Say_So, May 31 '20 at 22:51

score 2 · Answer 1 · answered Jun 05 '20 at 07:02

It would be good if you could check (and show here) the low-level SASS view of the Source page, along with the high-level CUDA-C view. The source metrics are collected per SASS (assembly) instruction, and then aggregated up in the CUDA-C view. Checking the actual assembly can be informative on the type of instructions generated by the compiler, and could better explain the data you are seeing.

Does a request not count as an instruction? From this chart's perspective, what counts as a shared memory instruction?

Requests and instructions are not the same thing. Instructions are the actual SASS assembly instructions being executed. Requests are generated by the HW as a result of executed instructions, and the number of requests may vary depending on how well the code behaves.

I'll see if I can post the SASS later today. Re: instructions vs. requests, I get how 1 instruction could generate 1 or more requests, but it's not clear to me how there could be 1 request from 0 instructions. — Daniel A. Thompson, Jun 05 '20 at 12:04

How does Nsight Compute determine/display shared memory metrics?

1 Answers1