10

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform. Any suggestion of how to measure/count this?

Here some background information on the used system:

processor   : 0
model name  : ARMv7 Processor rev 2 (v7l)
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part    : 0xc08
CPU revision    : 2

Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ... Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs. Any suggestions or pointers would be appreciated.

Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second. Floating point operations per second for matrix multiplication

tzwickl
  • 1,341
  • 2
  • 15
  • 31
  • Why is FLOPS a relevant comparative metric here? The performance of your matrix multiplication can be measured as the number of matrix multiplications per second. – Oliver Charlesworth Jan 25 '15 at 23:44
  • @OliverCharlesworth But if I say for an instance that I can perform 10 matrix multiplication per second with this algorithm this doesn't mean anything because it depends on the used processor ...If I run it on a faster processor it suddenly becomes 5 or so ... But if I know how many FLOPS this algorithm needs than it is the same on each processor ... Because usually you know how many FLOPS a processor can perform ... – tzwickl Jan 25 '15 at 23:49
  • What is the same on each processor? – Oliver Charlesworth Jan 25 '15 at 23:54
  • 2
    @OliverCharlesworth If you know how many FLOPS your algorithm can perform than you can generalise this and use it as comparison for other applications. That was what I meant with the same on each processor ... Look at this for example: "My CPU is capable of X GFLOPS, and I'm only actually achieving a throughput of, say, 20% of that" is very valuable information in high-performance software. It means that something other than the floating point ops is holding you back, and preventing the FP units from working efficiently. – tzwickl Jan 26 '15 at 00:09
  • 1
    Agreed, it makes sense as an absolute metric (sort of; as well as actual bottlenecks, you're unlikely to reach theoretical max because of suboptimal vectorization). But your original question makes it sound like you're trying to use it as a comparative metric vs some other algorithm. – Oliver Charlesworth Jan 26 '15 at 00:12
  • @OliverCharlesworth yeah you're right maybe I didn't get the point exactly in the question ... But anyway it would be nice to have some way of comparing these algorithm, besides their runtime and there I think FLOPS are the best way because if you know that COOS multiplication has a throughput of 30% and CSR a throughput of 50% and the normal matrix multiplication one with 60%, you can say that the normal matrix multiplication is more performant regarding the number of FPOs ... That's what we want to achieve ... – tzwickl Jan 26 '15 at 00:24
  • 1
    Hmm, that's a stand way to compare! Who cares if one version is better at utilising the CPU if another version is actually faster! – Oliver Charlesworth Jan 26 '15 at 00:25
  • Count the number of operations the _algorithm_ needs to perform, then count the number of seconds it takes to execute some code that performs those operations. Just because the FPU _might_ be able to hit 4 operations per cycle under perfect conditions (e.g. pipelined VMLA without dependency stalls), doesn't mean anything if your algorithm doesn't match those conditions. It may also be worthy of note that the NEON unit on Cortex-A8 is pretty rubbish compared to just about anything else. – Notlikethat Jan 26 '15 at 00:36
  • @OliverCharlesworth Yes I get your point and I mostly agree with this statement, but I think that utilisation of the CPU is coming into account when you use a different processor ... Consider a processor which can perform x times more FPOs than the one I use and if the program has a better FPO utilisation I think the running time will also become much faster than on the one I use ... Then the running time won't increase proportional to the CPU cycle time ... And this means that the running time are completely different and maybe the normal matrix multiplication becomes faster ... – tzwickl Jan 26 '15 at 00:36
  • @Notlikethat With Operations the algorithm needs to perform, you mean floating point instructions like add, sub and mul? But how should I count them ... they completely depend on the matrices I multiply ... That's why I need some automised method of counting them ... – tzwickl Jan 26 '15 at 00:42
  • 1
    I haven't read all text here but if you can get away with a single 32-bit counter, you can reserve a register (gcc has such option) and use that register to keep track of FPOs. That way overhead will be slim and measurable. It might be slightly tricky to implement but might look cool with some nice macros. – auselen Jan 26 '15 at 09:25
  • @tom1991te, have you solved the problem by now? – yulian Apr 27 '15 at 03:12
  • @YulianKhlevnoy yes kind of, I did as suggested by Notlikethat to count the number of instructions by placing counters in the program ... After that I removed the counters and measured the time it needed for the same input ... After that I looked up how many FLOPS one specific instruction has and multiplied everything ... I got very reasonable results ;) – tzwickl Apr 27 '15 at 10:34
  • @tom1991te, that's good indeed. As there is no accepted answer, I thought to start a bounty on this question. If you still think there may be better solution, i'll do it :) The question itself seems very interesting and may turn out useful for other users (and for me). – yulian Apr 27 '15 at 12:14
  • 1
    @YulianKhlevnoy Yes of course it would certainly be interesting to find out if there are other ways of solving this problem ;) The approach I used was sufficient for my application because it was a very small one but if you have a larger problem this approach has certainly the disadvantage that it is very cumbersome to calculate everything manually. – tzwickl Apr 27 '15 at 18:10
  • @tom1991te may be you can answer your own question with your solution. – auselen Apr 28 '15 at 08:32

2 Answers2

10

It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.


Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.

Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.

Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?

* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.

Community
  • 1
  • 1
Notlikethat
  • 20,095
  • 3
  • 40
  • 77
  • Hmm, memory bandwidth and integer performance may parallel or be in series for the different algorithms your last sentence is certainly correct. However, the ultimate goal is to perform the FLOPS. I think if you did have a FLOPS measurement, it would be a worthwhile benchmark, but wouldn't tell you how to improve things so the performance counters are better for more use cases. Also, some ARM CPUs may have dual 'floating-point' pipelines and others only one. – artless noise Apr 29 '15 at 13:41
  • The last example has actually not so much to do with the actual problem... Of course you would take the second one because everyone knows that multiplication costs more than addition, but in my problem I don't have a choice because I need to multiply those two matrices and there is no other way to do that. Your example is more about performance boosting by using different equations to the same problem ... I believe that for mathematical computations FLOPs are more expressive than execution time. – tzwickl Apr 29 '15 at 21:35
  • 1
    Huh? Isn't the whole point that you have 3 _different_ ways to multiply two matrices, and you want to compare them, but using a measure which is demonstrably misleading? Since FLOPS is a well established "more is better" measurement, the graph apparently shows that both the COOS and CSR methods are orders of magnitude _slower_ than the normal form - expressive indeed! ;) – Notlikethat Apr 30 '15 at 01:42
  • Yes but the measurements were done with a certain sparsity. The point I try to make is that when the matrix reaches a certain sparsity the normal matrix multiplication is faster. This point where the normal matrix multiplication gets faster than the other two methods depends on the FPU. Because as more FLOPs you can make as faster the normal matrix multiplication gets, but it doesn't influence the other two matrix multiplications that much because they also depend on the CPU. And how can you better represent this relationship then in a FLOPS graph ... – tzwickl Apr 30 '15 at 13:13
  • This is the fundamental flaw in the reasoning here: you **cannot** represent that relationship with FLOPS _at all_ in this way. FLOPS can only indicate "faster" if the number of operations is held constant. It can only indicate "more efficient" if the time is held constant. If _both_ quantities vary as they do here, their ratio alone cannot represent that information, and thus becomes useless as a comparison. This is simple to illustrate: for instance, I have 4 screens near me; can you tell me which is biggest if I tell you their aspect ratios are 1.6, 1.78, 1.6, and 1.25? – Notlikethat Apr 30 '15 at 19:09
-1

Number of Cores x Average frequency x Operations percycle

Daniel Frausto
  • 113
  • 1
  • 1
  • 15