How to do benchmarking for C/C++ code accurately?

Question

I'm asking regarding answers on this question, In my answer I first just got the time before and after the loops and printed out their difference, But as an update for @cigiens answer, it seems that I've done benchmarking inaccurately by not warming up the code.

What is warming up of the code? I think what happened here is that the string was moved to the cache first and that made the benchmarking results for the following loops close to each other. In my old answer, the first benchmarking result was slower than others, since it took more time to move the string to the cache I think, Am I correct? If not, what is warming up actually doing to code and also generally speaking if possible, What should I've done else than warming up for more accurate results? or how to do benchmarking correctly for C++ code (also C if possibly the same)?

Yes, warming up is filling the cache. But to your question, it is hard to benchmark on optimized systems. Compare different code on the same system to be comparable. Mostly it is good enough — RoQuOTriX, Apr 29 '20 at 19:47
@RoQuOTriX By optimized systems you mean different computers running the same pre-optimized/warmed up code for benchmarking? — faressalem, Apr 29 '20 at 19:57
With optimized system I meant, processors with hyperthreading, micro-instructions, OS with differents priorities, pipelining, so much influences the performance, that you cant influence. But if you stay on the same system, it should be the same — RoQuOTriX, Apr 29 '20 at 19:59
Related: [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) covers a lot of the specific pitfalls that warm-up solves (page faults from lazy allocation, TLB / cache misses on first access to some memory, branch mispredicts). Also related: [Simple for() loop benchmark takes the same time with any loop bound](https://stackoverflow.com/a/50934895). But — Peter Cordes, Apr 30 '20 at 00:37

Cevik · Answer 1 · 2020-04-30T01:36:14.643

To give you an example of warm up, i've recently benchmarked some nvidia cuda kernel calls:

The execution speed seems to increase over time, probably for several reasons like the fact that the GPU frequency is variable (to save power and cooldown).

Sometimes the slower call has an even worse impact on the next call so the benchmark can be misleading.

If you need to feel safe about these points, I advice you to:

reserve all the dynamic memory (like vectors) first
make a for loop to do the same work several times before a measurement
this implies to initialize the input datas (especially random) only once before the loop and to copy them each time inside the loop to ensure that you do the same work
if you deal with complex objects with cache, i advice you to pack them in a struct and to make an array of this struct (with the same construction or cloning technique), in order to ensure that the same work is done on the same starting data in the loop
you can avoid doing the for loop and copying the datas IF you alternate two calls very often and suppose that the impact of the behavior differences will cancel each other, for example in a simulation of continuous datas like positions

concerning the measurement tools, i've always faced problems with high_resolution_clock on different machines, like the non consistency of the durations. On the contrary, the windows QueryPerformanceCounter is very good.

I hope that helps !

EDIT

I forgot to add that effectively as said in the comments, the compiler optimization behavior can be annoying to deal with. The simplest way i've found is to increment a variable depending on some non trivial operations from both the warm up and the measured datas, in order to force the sequential computation as much as possible.

How to do benchmarking for C/C++ code accurately?

1 Answers1

Linked