C++ How to benckmark very fast operations

Question

I'm in VC++2013, Windows 7-64, Intel i7 3.6 GHz. I want to measure the execution time of very fast math operations, for example I wish to compare the performance of the standard fabsf() function with alternate "faster" methods, or the standard tanh() vs. the Pade approximation, etc.

The problem is that these operation are soooo fast that even though I run them a gazillion times, I always get 0 milliseconds between the end and the start of the benchmark.

I tried to get the time in nanoseconds using <chrono> but it is rounded to a tenth of a millisecond, not really a nanosecond, so I still get 0 elapsed nanoseconds in my benchmark.

Can you please provide a snippet of code that I can use to run my benchmarks?

This is mine:

#include <vector>
#include <chrono>
#include <ctime> 
using namespace std;

// 1/RAND_MAX
#define RAND_MAX_RECIP      0.00003051757f

int _tmain(int argc, _TCHAR* argv[])
{
    srand (static_cast <unsigned> (time(0)));

    // Fill a buffer with random float numbers
    vector<float> buffer;
    for (unsigned long i=0; i<10000000; ++i)
        buffer.push_back( (float)rand() * RAND_MAX_RECIP );

    // Get start time
    auto start = std::chrono::high_resolution_clock::now();

    for (unsigned long i=0; i<buffer.size(); ++i)
    {
        // do something with the float numbers in the buffer
    }

    // Get elapsed time
    auto finish = std::chrono::high_resolution_clock::now();

    printf("Executed in %d ns\n\n", std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count());

    return 0;
}

Can you make sure that the operations you perform are not optimized away by the compiler resulting in zero time elapsed in any case because no work is actually done? — Jens, Aug 02 '14 at 13:40
I don't think Windows has timers finer than a tenth of a millisecond. Use Linux instead or use the platform-specific `QueryPerformanceCounter` function. — tmyklebu, Aug 02 '14 at 13:41
You run the very fast operation many times, and get an average. — Some programmer dude, Aug 02 '14 at 13:42
First like Jens said, you should verify your compiler optimization flags. Second, unless you use very specific OSes the counter accuracy will not reach nanosecs. Third, at this level I expect the overhead caused by the "for" loop to bias your results. At this point I would jump in ASM... — TocToc, Aug 02 '14 at 13:46
btw in MSVC 2013 `high_resolution_clock` is garbage: is actually typedef to `system_clock`. This will be fixed in MSVC 2014 and it will use `QueryPerformanceCounter`. — Nazar554, Aug 02 '14 at 13:49
I'm compiling this code with /O2 because I need to know effectively how fast my actual application will run when I compile the release build. On a side note, the release build will also be compiled with SSE2 instructions on, but I have disabled them for this benchmark. — Mark Miles, Aug 02 '14 at 13:52
You can use [rdtsc intrinsic](http://msdn.microsoft.com/en-us/library/twchhe95.aspx), which reads the cpu cycle count. — rcgldr, Aug 02 '14 at 16:09

score 6 · Accepted Answer · answered Aug 02 '14 at 13:57

6

I think the most likely problem is that the compiler notices that you aren't using the results of your calculations and optimizes the calculation away. You simply need to convince the compiler to not do that.

I would recommend simply keeping a running sum of all calculation results, and printing it out after you've printed the time the loop takes. You'll ignore the final sum, but the compiler won't know that.

answered Aug 02 '14 at 13:57

Max Lybbert

19,717
4
46
69

@SingleStepper: At least according to the timestamps, this was posted before your answer. – Jerry Coffin Aug 02 '14 at 15:11

score 5 · Answer 2 · answered Aug 02 '14 at 14:03

To prevent the problem that Jens alluded to, you must make use of the result. To solve the problem of no matter how many time I set the counter, the time is always 0, you take an alternate approach. Run the operation for 1 second and count how many times it was processed.

Psuedo code is

   double TestFunc()
   {  
        double dSum=0, dForce=0;
        while(run)
        {
             // do test and keep the result
             // dForce += fabs(v); // whatever v is - just keep the result
             dSum +=1;  
        }
        printf("anchor answer is "+dForce) ;// this forces the compiler to generate code
        return dSum;
    }

Then run that code for 1 second, or however long.

The trick then is to run the same loop without the test code and see how many times it iterates. You then subtract the first number from the second to see how long your code (alone) took.

Sneftel · Answer 3 · 2014-08-02T14:06:27.047

Functions like fabs(), which map directly to instructions, are difficult to evaluate in synthetic benchmarks because their execution time is so low compared to pipeline latency, memory access time, etc. For example, if you have a loop which reads a float from an array, finds its absolute value, then writes the value back to the array, it's possible that doing a second fabs() in the loop wouldn't make a difference to the execution time -- the algorithm would be memory-bound, not CPU-bound.

By the same token, it's difficult to measure the "speed" of an operation like fabs with a single number. Particularly with certain multiple-issue and out-of-order processors, the time taken to perform such an operation is heavily dependent on what other operations are being performed before and after it.

You should take a look at Agner Fog's pages on x86/x64 instruction timings to get an idea of the nuances involved. As for practicalities, don't bother trying to time a single operation. Try to time an algorithm that you actually want to use the operation in. If there's a difference, you know which one to use, and you know that that choice is correctly contextualized for your particular use case. If there's no significant difference (and I'm guessing there won't be), then you know it doesn't matter.

score 3 · Answer 4 · answered Aug 03 '14 at 08:54

You can use the rdtsc instruction to time at the clock cycle level.

uint64_t begin = __rdtsc();
_mm_lfence();
// insert your code here
_mm_lfence();
uint64_t end = __rdtsc();
uint64_t clocks = end - begin;

The fences are there to avoid reordering the instructions.

Time a couple hundred thousand times and take the median value. The following pitfalls apply:

You're probably using an intel CPU with turbo boost. Disable that. rdtsc always ticks according to the processor's base clock. I usually use throttlestop for this.
Because you're writing in c++, in general you have no control over what the compiler generates. Nothing avoids the compiler from generating an cmov (conditional move) instruction that reads from memory instead of an register.
The speed of the instruction sequence can be measured in multiple ways. For example, an SSE multiply instruction takes 5 clock cycles before the result is available ('latency'). But the CPU can issue 1 multiply per clock cycle. There are other instructions that can be issued multiple times per clock cycle. Or take more than one clock cycle to issue.
There is also the issue of instructions that take an variable amount of time, such as DIV, or an branch, or anything that reads from memory.

You might want to use http://agner.org/optimize/#testp for running benchmarks at the instruction level.

An simple overclocking tool, but also useful for fixing your CPU on the clock speed you want. Download it here: http://www.techpowerup.com/downloads/2288/throttlestop-6-00/ — Stefan, Aug 05 '14 at 07:39
The TSC frequency isn't always equal to the max non-turbo frequency. e.g. my i7-6700K has a non-turbo frequency of 4.0 GHz, but a TSC of 4008 MHz. Some newer CPUs can differ by more, like i5-1035 Ice Lake has TSC = 1.5 GHz, base = 1.1 GHz. IIRC that data in my answer on [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) was from BeeOnRope. — Peter Cordes, Jun 01 '23 at 23:38
Ideally use `rdpmc` to read a perf counter that's counting the `cpu_clk_unhalted.thread` event, counting true core clock cycles for this logical core. (e.g. run under `perf stat` as an easy way to get counters programmed, or one of the light-weight kernel modules designed to allow programming counters for collection by user-space. I forget what `perf` / kernel setting is required to allow user-space rdpmc to not fault.) — Peter Cordes, Jun 01 '23 at 23:42

score 1 · Answer 5 · answered Aug 02 '14 at 14:37

The general strategy for benchmarks of this kind is:

Estimate the time you expect (somehow, maybe by experimentation).
Write code to execute the sequence under test multiple times so that the result lies in a suitable range for your timing tools (say between 1 and 100 seconds).
Choose an optimisation level, depending on what you're measuring.
Examine the generated code to ensure it does what you expect.
Do multiple runs: one to fill up any caches and then at least 2 repetitions to make sure you get the same answers.
Keep careful records of loop counts and times.
Test by 2 or 3 different strategies and make sure you get consistent results across all tests.

You will find that compilers can be very sneaky in skipping loops that do no useful work. Disable optimisation and/or make the code more complicated until the compiler generates the sequence you want.

Watch carefully for pipeline and caching effects. Unless or until you can get exactly matching answers on multiple repetitions by multiple strategies you cannot rely on the results. This is experimental computer science, and it's hard.

C++ How to benckmark very fast operations

5 Answers5