Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:
How to evaluate CUDA performance?
How Do You Profile & Optimize CUDA Kernels?
How to calculate Gflops of a kernel
Counting FLOPS/GFLOPS in program - CUDA
How to calculate the achieved bandwidth of a CUDA kernel
Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!