0

I have a somewhat large CUDA application and I need to calculate the attained GFLOPs. I'm looking for an easy and perhaps generic way of counting the number of floating point operations.

Is it possible to count floating point operations from the generated PTX code (as shown below), using a list of predefined fpo in assembly language? Based on the code, can the counting be made generic? For example, does add.s32 %r58, %r8, -2; count as one floating point operation?

EXAMPLE:

BB3_2:
.loc 2 108 1
mov.u32         %r8, %r79;
setp.ge.s32     %p1, %r78, %r16;
setp.lt.s32     %p2, %r78, 0;
or.pred         %p3, %p2, %p1;
@%p3 bra        BB3_5;

add.s32         %r58, %r8, -2;
setp.lt.s32     %p4, %r58, 0;
setp.ge.s32     %p5, %r58, %r15;
or.pred         %p6, %p4, %p5;
@%p6 bra        BB3_5;

.loc 2 112 1
ld.global.u8    %rc1, [%rd17];
cvt.rn.f32.u8   %f11, %rc1;
mul.wide.u32    %rd12, %r80, 4;
add.s64         %rd13, %rd7, %rd12;
ld.local.f32    %f12, [%rd13];
fma.rn.f32      %f14, %f11, %f12, %f14;
.loc 2 113 1
add.f32         %f15, %f15, %f12;

Or are there far simpler ways of counting FPOs and this is a waste of time?

  • `add.s32` is a signed, 32 bit integer addition. So that obviously doesn't count as a FLOP. The code you posted contains one floating point multiply-add instruction which should be counted as 2 FLOP. But note that PTX is only an intermediate representation of the code, it isn't what the GPU runs, so counting PTX instructions is probably futile- – talonmies Feb 11 '13 at 13:11
  • Thanks. Looks like it's back to the original code for me. – Francis Saa-Dittoh Feb 11 '13 at 13:32
  • I assumed it would work cos i read an answer [here](http://stackoverflow.com/questions/5330717/counting-flops-for-a-code) that said "_Try to either take intermediate assembly code or decompile exe. Then count all floating point operations (in x86 assembly code they start with F prefix like FSIN)._" – Francis Saa-Dittoh Feb 11 '13 at 13:42
  • If you understand the code flow and the architecture you are running on, it should be possible to dump the SASS assembly code (`cuobjdump --dump-sass myfile`) The SASS assembly code should be what the device actually executes. Unfortunately NVIDIA doesn't provide a SASS reference that I am aware of, but if you familiarize yourself with the [PTX reference](http://docs.nvidia.com/cuda/pdf/ptx_isa_3.1.pdf), you should be able to generally understand the SASS. – Robert Crovella Feb 11 '13 at 14:02

1 Answers1

3

The easiest way to count FLOPS would be to have the CUDA profiler do it for you. By selecting the Achieved FLOPS experiment, you can get charts like this:

FLOPS experiment

The Floating Point Operations chart displays a count of each type of floating point operation executed by your kernel.

Greg Smith
  • 11,007
  • 2
  • 36
  • 37
Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • The problem I have is, I am running the code on a cluster through SSH and I can't install the profiler on the cluster. Is there a way to export the information and view on the profiler? – Francis Saa-Dittoh Feb 11 '13 at 16:52
  • @masoftheund: I don't know, but go ahead and add a new question for that so that everyone sees it. – Roger Dahl Feb 11 '13 at 20:18