0

I am using nvprof to measure achieved occupancy and I am findind it as

Achieved Occupancy 0.344031 0.344031 0.344031

but using occupancy calculator , I am finding 75%.

The results are:

Active Threads per Multiprocessor   1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor    75%

I am using 33 registers , 144 bytes shared memory , 256 threads/block ,device capability 3.5.

EDIT:

Also , something I want to clarify.In http://docs.nvidia.com/cuda/profiler-users-guide/#axzz30pb9tBTN it states for

gld_efficiency

Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage

So , If this is 0% it means that I have no global memory transfers in the kernel?

:

George
  • 5,808
  • 15
  • 83
  • 160

1 Answers1

2

You need to understand that the occupancy calculator is providing the maximum theoretical occupancy that a given kernel can achieve, based only on the resource requirements of that kernel. It does not (and cannot) say anything about how much of that theoretical occupancy the code is capable of achieving.

The profiling tools, on the other hand, deduce actual occupancy from measured profile counters. According to this document, the achieved occupancy number you are asking about is calculated as

(active_warps / active_cycles) / MAX_WARPS_PER_SM

ie. it samples the number of active warps on one or more SM during a kernel run and calculates actual occupancy from that

There can be a lot of reasons why a kernel doesn't achieve its theoretical occupancy, and (before you ask), no I can't tell you why your kernel doesn't reach theoretical occupancy. But the Visual Profiler can. If it is important to you, I suggest you look at the automated performance analysis features available in the CUDA 5/6 visual profiler as a way of better understanding the performance of your code.

It is also worth pointing out that occupancy should be treated as only a rough metric of potential code performance, and high theoretical occupancy doesn't always translate into high performance. Instruction level parallelism and latency minimisation strategies can also be very effective at reaching high levels of performance, even at low occupancy. There is a large body work on this, most stemming from Vasily Volkov's seminal GTC 2010 paper.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • :Ok , thaks.So, the acieved occupancy created from nvprof is what I should look.I have in mind the Volkov's paper,thanks.Just , I want to check also occupancy ,not to be too low.Finally , for the gld_efficiency I am asking ,is what I say?Or the opposite? – George May 05 '14 at 12:04
  • :Hello , I just run nsight profiler and I saw that in every kernel it has different number of registers than the ptxas options=-v gives me.So , I will take as number of registers the number from nsight profiler?Also, for the gld_efficiency ,if it is 0% means that I have no global memory transfers in the kernel?Thanks! – George May 06 '14 at 08:18