Nvidia's nvprof outputs for FLOPS

Question

I see that nvprof can profile the number of flop in the kernel (using the parameters as below). Also when I browse through the documentation (here http://docs.nvidia.com/cuda... it says flop_count_sp is "Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special). Each multiply-accumulate operation contributes 2 to the count."

However when I run, the result of flop_count_sp (which is supposed to be flop_count_sp_add + flop_count_sp_mul + flop_count_sp_special + 2 * flop_count_sp_fma) I find that it does not include in the summation the value of flop_count_sp_special.

Could you suggest me what I am supposed to use? Should I add this value to the sum of flop_count_sp or I should consider the formula does not include the value of flop_count_sp_special?

Also could you please tell me what are these special operations?

I'm using the following command line:

nvprof --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma --metrics flops_sp_special myKernel args

Where myKernel is the name of my CUDA kernel which has some input arguments given by args.

A section of my nvprof outputs for instance is as shown below:

 ==20549== Profiling result:
 ==20549== Metric result:
 Invocations                               Metric Name                        Metric Description         Min         Max         Avg
 Device "Tesla K40c (0)"
    Kernel: mykernel(float*, int, int, float*, int, float*, int*)
           2                             flop_count_sp  Floating Point Operations(Single Precisi       70888       70888       70888
           2                         flop_count_sp_add  Floating Point Operations(Single Precisi       14465       14465       14465
           2                         flop_count_sp_mul  Floating Point Operation(Single Precisio       14465       14465       14465
           2                         flop_count_sp_fma  Floating Point Operations(Single Precisi       20979       20979       20979
           2                     flop_count_sp_special  Floating Point Operations(Single Precisi       87637       87637       87637

Can you provide the values for the individual events/metrics? Your command line does not seem valid to me, there's no such metric as `flops_sp*`. — Tom, Jun 06 '17 at 12:10

score 7 · Accepted Answer · answered Jun 06 '17 at 20:41

7

The "special" operations are listed in the arithmetic throughput table in the Programming Guide, they are: reciprocal, recip sqrt, log, exp, sin, cos. Note that these are less precise (but faster) than the default versions, you have to opt-in using the intrinsic or a compiler flag (-use_fast_math).

Despite what the documentation says, it seems the special operations are not included in the flop_count_sp total. That's a bug in the current version (8.0), I've filed a bug so it should be fixed in a future release (so this paragraph will be out of date at some point).

answered Jun 06 '17 at 20:41

Tom

20,852
4
42
54

Thank you so much for pointing about the special operations and for reporting the bug. But I am wondering since in my kernel I have not used any of these special operations then what operations could contribute the value "87637" for "flop_count_sp_special"? – Amit Jun 07 '17 at 12:05
I have shared my code here ( https://bitbucket.org/rajgurung777/blogqueryproject ). You can find the kernel implementation in the file simplex.cu – Amit Jun 07 '17 at 12:15
Can you link to the bug page? – einpoklum Jun 07 '17 at 21:45
@einpoklum sorry that's not possible. – Tom Jun 07 '17 at 21:57
Thank you Thomas for the prompt help, and Sorry for the code, since at present it has dependency (GLPK and Boost libraries) to compile. However, now I see that how the values in flop_count_sp_special have been contributed to "87637". In my kernel I have been using a number of places the expression x/y and x/y also fall under the special functions. I did not notice that earlier. – Amit Jun 08 '17 at 04:17
@Shadow _division_ is not a special operation, but _reciprocal_ is. In this context, "special" means that the operation is performed by the special function units as opposed to the CUDA cores (c.f. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0 for example, that's for Kepler which is what the OP is using). The special operations should really be included in the total, especially since the breakdown (add, mul, fma, special) is available in any case. – Tom Jun 08 '17 at 11:00
@Tom You're right. Sorry. May brain must have been afk :/ – BlameTheBits Jun 08 '17 at 14:41

Nvidia's nvprof outputs for FLOPS

1 Answers1