How does one determine the Gflop/s performance of a CUDA kernel?

Question

Update: In a slightly unexpected turn of events, the question which this is marked a duplicate of gained an excellent answer fulfilling my requirements after I posted this: How to calculate Gflops of a kernel. To summarize, both Nsight for Visual Studio and nvvp can report FLOPS if you ask them correctly.

I'm leaving my original question for posterity, but it is now redundant.

There have been a few questions (which I have read) about how one profiles a CUDA program, but I have not been able to find anything definitive about this. I have a piece of CUDA code, and have written about its performance at varying levels of optimization. It has been requested that I give absolute GFLOP/s numbers as well, and I'm not entirely sure that's possible, an if so, how to do so.

Relevant properties

GPU is a GTX 780Ti (Compute 3.5)
GPU is attached to a machine running CentOS 6.3 (this is non-negotiable)
CUDA toolkit is version 6.0: nvprof, nvvp, and nsight are available
Algorithm is data-dependent -- run-length is technically nondeterministic
Run length is long enough that nondeterminism averages out satisfactorily

Does there exist a way that I can profile the actual floating-point operation count of the kernels in this piece of software, or am I telling the reviewer that it's not possible in this situation?

nsight visual studio edition provides an ["achieved FLOPS"](http://docs.nvidia.com/gameworks/index.html#developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedflops.htm%3FTocPath%3DDeveloper%2520Tools%7CDesktop%2520Developer%2520Tools%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%25204.0%7CAnalysis%2520Tools%7CCUDA%2520Experiments%7CKernel-Level%2520Experiments%7C_____2) metric/experiment. I'm not sure if there is anything equivalent on the linux side. — Robert Crovella, Jul 24 '14 at 17:21
The answer provided by @talonmies to this post [How to calculate Gflops of a kernel](http://stackoverflow.com/questions/7875607/how-to-calculate-gflops-of-a-kernel) would be useful to read. You can present your reviewer much part of his argumentations, I think (mentioning the source, of course). — Vitality, Jul 24 '14 at 17:28
@JackOLantern -- That was a good answer, and reinforced my theory that this measurement is more-or-less pointless. However, it didn't actually cover the question of "can I?" (other than by saying "not possible", which cannot be correct given that the Visual Studio version of Nsight does give those numbers). Hence, I'm asking again: more specifically, and two and a half years later. — zebediah49, Jul 24 '14 at 17:29
My interpretation of the answer is that not only it is _pointless_, but also _not possible_ in practice. To evaluate the FLOPS you would need to count the exact number of operations the kernel is performing. Thus, you would need to know the number of operations that special functions (like `rsqrt()` or Bessel functions) that are not under your control are performing. The result is that most of the people is trying to derive accurate estimates of the GLOPS/s, but relying on very inaccurate models. I agree that only processing time counts, as the final aim of computation is to reduce the time. — Vitality, Jul 24 '14 at 17:53
If you are competing in a road race, perhaps it is pointless to count the number of steps per second of each contendent, but the only thing that will count will be the running time which will determine who will finally win. — Vitality, Jul 24 '14 at 18:01
I added a tools focused answer to [How to calculate Gflops of a kernel](http://stackoverflow.com/questions/7875607/how-to-calculate-gflops-of-a-kernel). — Greg Smith, Jul 24 '14 at 19:18
Don't leave it for posterity; close it so people are more likely to find the other question with the good answer. — Matthew Strawbridge, Jul 24 '14 at 23:15

How does one determine the Gflop/s performance of a CUDA kernel?

0 Answers0