I am using nvprof to measure achieved occupancy and I am findind it as
Achieved Occupancy 0.344031 0.344031 0.344031
but using occupancy calculator , I am finding 75%.
The results are:
Active Threads per Multiprocessor 1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor 75%
I am using 33 registers , 144 bytes shared memory , 256 threads/block ,device capability 3.5.
EDIT:
Also , something I want to clarify.In http://docs.nvidia.com/cuda/profiler-users-guide/#axzz30pb9tBTN it states for
gld_efficiency
Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage
So , If this is 0% it means that I have no global memory transfers in the kernel?
: