I want to get sample data per instruction. It turned out such tool is a little bit difficult to find.
The image below is a good example from Nvidia Nsight compute for profiling GPU programs.
At the right hand side, you can clear see each assembly code and the corresponding counters like how many time it get executed and the reason why it cannot be dispatched, i.e. Sampling Data (Not Issued)
.
Correlation between source code(left part) and assembly code is NOT necessary in my request.
For the detail of Nsight compute: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#profiler-report-source-page
I know about hot method and other profiling techniques. However I need very detailed profiling result for a piece of assembly code