Update: In a slightly unexpected turn of events, the question which this is marked a duplicate of gained an excellent answer fulfilling my requirements after I posted this: How to calculate Gflops of a kernel. To summarize, both Nsight for Visual Studio and nvvp can report FLOPS if you ask them correctly.
I'm leaving my original question for posterity, but it is now redundant.
There have been a few questions (which I have read) about how one profiles a CUDA program, but I have not been able to find anything definitive about this. I have a piece of CUDA code, and have written about its performance at varying levels of optimization. It has been requested that I give absolute GFLOP/s numbers as well, and I'm not entirely sure that's possible, an if so, how to do so.
Relevant properties
- GPU is a GTX 780Ti (Compute 3.5)
- GPU is attached to a machine running CentOS 6.3 (this is non-negotiable)
- CUDA toolkit is version 6.0:
nvprof
,nvvp
, andnsight
are available - Algorithm is data-dependent -- run-length is technically nondeterministic
- Run length is long enough that nondeterminism averages out satisfactorily
Does there exist a way that I can profile the actual floating-point operation count of the kernels in this piece of software, or am I telling the reviewer that it's not possible in this situation?