How to compare performance of two pieces of codes

Question

I have a friendly competition with couple of guys in the field of programming and recently we have become so interested in writing efficient code. Our challenge was to try to optimize the code (in sense of cpu time and complexity) at any cost (readability, reusability, etc).

The problem is, now we need to compare our codes and see which approach is better comparing to the others but we don't know any tools for this purpose.

My question is, are there some (any!) tools that takes a piece of code as input and calculates the number of flops or cpu instructions necessary for running it? Is there any tool can measure the optimacy of a code?

P.S. The target language is c++ but would be nice to know if such tools exists also for java.

+1 for the word "optimacy". Is it enough to run `time ./prog`? — Kerrek SB, Sep 09 '12 at 16:38
I don't think counting flops or CPU instructions is a good measure of efficiency. [It's easy to slap together artificial do-nothing code that can max that out.](http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle) — Mysticial, Sep 09 '12 at 16:44

score 12 · Accepted Answer · answered Sep 09 '12 at 16:40

12

Here's a little C++11 stopwatch I like to roll out when I need to time something:

#include <chrono>
#include <ctime>

template <typename T> class basic_stopwatch
{
    typedef T clock;
    typename clock::time_point p;
    typename clock::duration   d;

public:
    void tick()  { p  = clock::now();            }
    void tock()  { d += clock::now() - p;        }
    void reset() { d  = clock::duration::zero(); }

    template <typename S> unsigned long long int report() const
    {
        return std::chrono::duration_cast<S>(d).count();
    }

    unsigned long long int report_ms() const
    {
        return report<std::chrono::milliseconds>();
    }

    basic_stopwatch() : p(), d() { }
};

struct c_clock
{
    typedef std::clock_t time_point;
    typedef std::clock_t duration;
    static time_point now() { return std::clock(); }
};

template <> unsigned long long int basic_stopwatch<c_clock>::report_ms() const
{
  return 1000. * double(d) / double(CLOCKS_PER_SEC);
}

typedef basic_stopwatch<std::chrono::high_resolution_clock> stopwatch;
typedef basic_stopwatch<c_clock> cstopwatch;

Usage:

stopwatch sw;
sw.tick();

run_long_code();

sw.tock();
std::cout << "This took " << sw.report_ms() << "ms.\n";

On any decent implementation, the default high_resolution_clock should give very accurate timing information.

answered Sep 09 '12 at 16:40

Kerrek SB

464,522
92
875
1,084

On visual studio `high_resolution_clock` is horrible, use the boost libraries version in such case. – ronag Sep 09 '12 at 17:06
+1: Timing is usually better than profiling when it comes to determining which piece of code is faster. Profiling usually misses caching effects and similar which can have a huge effect on performance. – Leo Sep 09 '12 at 19:22
For high-resolution timing, you should directly use the appropriate system calls, such as clock_gettime() on linux, which provides actual nanosecond resolution. The Chrono call penalties might cause too much jitter/other negative effects. The clock_gettime() system call takes several hundred CPU ticks to run, but it's duration seems quite constant. If you wish _perfect_ timing, then disable your operating system's CPU frequency governor, and count actual CPU ticks with the processor's assembly instruction (such as rdtsc). – mic_e Sep 09 '12 at 20:53
1

@mic_e: The problem with TSC is that it doesn't play well with multiple cores. The steady clock provided by `chrono`, or `clock_gettime`, overcomes this. – Kerrek SB Sep 09 '12 at 21:19
Why is this so complicated? How is this any better than just using `std::chrono::high_resolution_clock::now();` before and after the function is called and subtracting the difference? – northerner Jan 23 '19 at 18:45
@northerner: If you just need to do this once, ever, then sure, but a small library solution like this makes repeated use of this a fair bit less cluttered. – Kerrek SB Jan 23 '19 at 22:46
Good ! But, why report_ms returns `unsigned long long int` instead of double ? BTW, I haven't heard of `unsigned long long int` such type. – Lewis Chan Aug 25 '19 at 12:17
@LewisChan: I think that's because the primary template returns a count that is typically integral, and so the specialization does the same. – Kerrek SB Aug 29 '19 at 21:04

score 3 · Answer 2 · answered Sep 09 '12 at 16:40

There is the std::clock() function from <ctime> which returns how much CPU time was spent on the current process (that means it doesn't count the time the program was idling because the CPU was executing other tasks). This function can be used to accurately measure execution times of algorithms. Use the constant std::CLOCKS_PER_SEC (also from <ctime>) to convert the return value into seconds.

huseyin tugrul buyukisik · Answer 3 · 2012-09-09T19:06:37.623

From the inline-assembly, you can use rdtsc instruction to get 32-bit(least significant part) counter into eax and 32-bit(highest significant part) to edx. If your code is too small, you can check total-approimate cpu-cycles with just eax register. If count is more than max. of 32-bit value, edx increments per max-32-bit value cycle.

int cpu_clk1a=0;
int cpu_clk1b=0;
int cpu_clk2a=0;
int cpu_clk2b=0;
int max=0;
std::cin>>max; //loop limit

__asm
{
    push eax
    push edx
    rdtsc    //gets current cpu-clock-counter into eax&edx
    mov [cpu_clk1a],eax
    mov [cpu_clk1b],edx
    pop edx
    pop eax

}

long temp=0;
for(int i=0;i<max;i++)
{

    temp+=clock();//needed to defy optimization to  actually measure something
                          //even the smartest compiler cannot know what 
                          //the clock would be
}

__asm
{
    push eax
    push edx
    rdtsc     //gets current cpu-clock-counter into aex&edx
    mov [cpu_clk2a],eax
    mov [cpu_clk2b],edx
    pop edx
    pop eax

}
std::cout<<(cpu_clk2a-cpu_clk1a)<<std::endl;
   //if your loop takes more than ~2billions of cpu-clocks, use cpu_clk1b and 2b
getchar();
getchar();

Output: 74000 cpu-cycles for 1000 iterations and 800000 cpu-cycles for 10000 iterations on my machine. Because clock() is time-consuming.

Cpu-cycle resolution on my machine: ~1000 cycles. Yes, you need more than several thousands of addition/subtraction(fast instructions) to measure it relatively correct.

Assuming cpu working frequency being constant, 1000 cpu-cycles is nearly equal to 1 micro-seconds for a 1GHz cpu. You should warm your cpu up before doing this.

score 0 · Answer 4 · answered Sep 09 '12 at 16:39

It is quite difficult to calculate the detailing number of cpu time from a block of code. The normal way to do this is to design the worse / average / best input data as test cases. And do a timing profiling based on your real code with these test cases. There is no any tool can tell you the flops when it is without the detailing input test data and conditions.

score 0 · Answer 5 · answered Sep 09 '12 at 16:41

0

There are pieces of software called profilers which exactly do what you want.

An example for Windows is AMD code analyser and gprof for POSIX.

answered Sep 09 '12 at 16:41

1

The problem with profiling to judge performance competitions is that the act of profiling itself slows down the program execution. Some algorithms might be affected more than others, which skews the results of the competition. Also, compiling with profiling support prevents some compiler optimizations which could be used to squeeze out the last bit of extra speed. Don't get me wrong - profilers are important tools to find performance bottlenecks. But you shouldn't trust them blindly. – Philipp Sep 10 '12 at 08:11
@Philipp I don't trust profilers blindly, I'm not that maths freak type of guy. I thought it would have been a good idea to mention them as it seemed to me that OP didn't know about them at all. – Sep 10 '12 at 10:50
@Philipp Neither do I see why should this answer be downvoted - it is technically correct... – Sep 10 '12 at 12:51
Again, please leave a comment if you downvote with your **precise reasoning!** – Sep 10 '12 at 20:03

PiotrNycz · Answer 6 · 2012-09-12T18:56:53.617

0

Best for your purposes is valgrind/callgrind

edited Sep 12 '12 at 18:56

answered Sep 09 '12 at 17:23

PiotrNycz

23,099
7
66
112

score 0 · Answer 7 · answered Sep 09 '12 at 18:31

Measuring the number of CPU instructions is pretty useless.

Performance is relative to bottleneck, depending on the problem at hand the bottleneck might be the network, disk IOs, memory or CPU.

For just a friendly competition, I would suggest timing. Which implies providing test cases that are big enough to have meaningful measures, of course.

On Unix, you can use gettimeofday for relatively precise measures.

How to compare performance of two pieces of codes

7 Answers7

Linked