What is the most efficient and correct way to measure the time it takes for some instructions to execute?

Question

I would like to measure how much time it takes to execute a piece of code. What would be the most effective and correct way to do this.

I wrote a code that looks like the one below and the results are not the same every time. There is a level of randomization occurring and I do not know why and how to remove this effect.

#include<stdio.h>
#include<time.h>

int main() 
{
   int x=10;
   int y=25;
   int z=x+y;
   printf("Sum of x+y = %i", z);
   time_t start = clock();
   for(int i=0;i<100000;i++)z=x+y;
   time_t stop = clock();

   printf("\n\nArithmetic instructions take: %d",stop-start);
   start = clock();
   for(int i=0;i<100000;i++)z=x&y;
   stop = clock();

   printf("\n\nLogic instructions take: %d",stop-start);
}

The results look like this:

Arithmetic instructions take: 327
Logic instructions take: 360

Arithmetic instructions take: 271
Logic instructions take: 271

Arithmetic instructions take: 287
Logic instructions take: 294

Arithmetic instructions take: 279
Logic instructions take: 266

Arithmetic instructions take: 265
Logic instructions take: 296

What other ways are to measure the time it takes to execute the loops.

NOTICE: The loops are NOT removed by the compiler optimization, I checked it.

So, what is the correct way to benchmark a piece of code?

https://stackoverflow.com/questions/2588310/measuring-cpu-clocks-consumed-by-a-process — dbrank0, Sep 27 '17 at 11:20
When benchmarking and measuring, do it *multiple times*. Perhaps even thousands or *millions* of times. Then list the max, min, average and mean time. You need to do it multiple times because on modern multi-tasking systems it evens out the difference in what the system is doing when you measure. — Some programmer dude, Sep 27 '17 at 11:21
Basically, you have to use system-specific APIs for high resolution timers. Then also the result will vary depending on system. For example it is impossible to benchmark code with exact results on a desktop OS, because it is not a RTOS and there are other programs running on the computer. You can get a pretty good idea of efficiency though, particularly when comparing two different snippets. — Lundin, Sep 27 '17 at 11:23
@Someprogrammerdude "Millions". The last time I did this I had to loop hundreds of millions of times. See https://stackoverflow.com/questions/45380085/most-efficient-method-to-print-test/45381357#45381357 . — Steve Summit, Sep 27 '17 at 11:53
The duplicate answer is far from perfect. @OP: on modern architecture CPUs like x86 you basically can't even measure single instruction meaningfully, as the single CPU core is capable to run several instructions at the same time, so without showing the code around instruction it doesn't make much sense to time just single one. An individual well separated instruction without dependencies may perform in benchmark at the same speed as other one, yet in real code one of them can be stalling CPU due to clash with other instructions over CPU resources, while the other may run together with them. — Ped7g, Sep 27 '17 at 11:58
Basically on modern x86 CPU almost everything is 1 clock in ideal case (when measured separately), except few rare exceptions like division, etc. All the basic arithmetic and register manipulation is 1c, but when you write real code with them, the total performance is completely different, as the total time for N instructions may be like N/2 clocks on average easily, but you will get several tens-hundreds cycles stalls due to cache misses, mispredicted branches, register dependencies, etc. For good performance it's more important to design data/code structure well, than to use particular ins. — Ped7g, Sep 27 '17 at 12:03
In short benchmarking is BS. It is sooo easy to manipulate the results even if you are dealing with a fixed sequence of machine code instructions you can still manipulate their performance (in some cases, not always depends on the system). High level code of course is very easy to manipulate and change the performance. I recommend Michael Abrash the Zen of Assembly language, the 8088 was obsolete when it came out and is of course now but if that is all you see you are missing the whole point, the foundation applies to all of these performance questions... — old_timer, Sep 27 '17 at 12:08
The only correct answer here is not using system timers, etc, it is to watch the code execute in a simulation, to be able to watch each step of the pipe, and manipulate the fetches, etc, to not stall the pipe or otherwise affect the code under test. Anything short of that is not a very god measurement. And again, yes, and is faster than add, absolutely, but they give enough room in the clock cycle to cover the longest pole in the tent, add is a short pole. So to see the difference you would have to be able to see the actual analog of the logic which you cant but you can sim. — old_timer, Sep 27 '17 at 12:11
you are running on top of an operating system against dram, even if bare metal you will see a variation in time against dram. get it out of dram you still have operating system and your surrounding code possibly affecting the code under test. at a minimum you need to put a loop of at least two around each test to get the test into the cache and off of dram. if the test is remotely accurate the first one will be slower than the ones that follow and they may vary but by less of a percentage. — old_timer, Sep 27 '17 at 12:13
@Lundin: This is not a duplicate. Using high-rez timers won't make the OP's approach work in this case. If those loops *don't* optimize away, then optimization is probably disabled and the results are just measuring loop overhead. Short sequences of asm can't be fully characterized by just putting them in a loop and timing it, and the OP is doing that with C statements, not asm. Working on an answer. — Peter Cordes, Sep 27 '17 at 12:34
Are you sure that your loops aren't optimized away? See [here](https://godbolt.org/g/GNY4PD) the differences between the generated codes. Misuring unoptimized code doesn't seem really useful to me. — Bob__, Sep 27 '17 at 12:38
@Bob__ I believe that the loops are not optimized because if I add another zero to the number of loops the execution of the loops takes 10 times more which. In my opinion this means that the code in the loop executes as many times as the loops says which means that the loops are not optimized. Is this correct? — , Sep 27 '17 at 12:46
@old_timer Thank you for the information that you wrote here. Can you post an answer so I can accept it? I now understand that it is very hard to properly measure the performance of an instruction on modern architectures but your information is still interesting and relevant. Thank you. — , Sep 27 '17 at 12:47
Closed as a duplicate of your other question about this code. The right answer is more or less what [@Art says](https://stackoverflow.com/a/46447910/224132): This measure technique won't work for this small a chunk of code. CPU performance doesn't work like that anymore, especially not when you're only looking indirectly through an optimizing compiler. See http://agner.org/optimize/ and other links in the [x86 tag wiki](https://stackoverflow.com/tags/x86/info) if you're interested in x86. — Peter Cordes, Sep 27 '17 at 13:35
https://en.wikipedia.org/wiki/AND_gate and https://en.wikipedia.org/wiki/Adder_(electronics) assuming it takes the same amount of work to fetch the operands and store them and such and simply comparing and vs add. and is one gate which you could expand into transistors, a full adder though is three levels, without any other knowledge you can assume it takes three times as long for the adder to resolve a bit, this is analog/physics. In the digital domain you feed the inputs to the and or adder on a clock edge (change their input state) then you wait for them to settle then sample the result. — old_timer, Sep 27 '17 at 14:07
but you design your system such that at least add and and have enough time to finish in a prescribed number of clocks, ideally one, lets say it is one, so you do your timing analysis (very doubtful that add or and are the long poles in the tent) and determine from the things you designed to be latched in one clock cycle what is the longest/slowest one, and either change the design or call that the fastest you can go with this design. that period of time is going to be long enough for add or and to settle so you wont be able to measure the differences in this way with existing silicon. — old_timer, Sep 27 '17 at 14:09

gsamaras · Answer 1 · 2017-09-28T18:53:00.743

1

The numbers you get suggest that you did not compile with Optimization flags enabled, which makes your benchmarking useless.

For example I compiled with:

 gcc prog.c -Wall -Wextra -O2 -march=native

and got for for(long long int i=0;i<10000000000000;i++) timings of 0 and 1.

Which one gets the 1?

The first for loop, whichever this is (I mean & or + operator). The reason behind that is that the first loop might find program in a cold start phase, regarding for instance caching.

So, what is the correct way to benchmark a piece of code in general?

Compile with Optimization flags (e.g. -O3 in GCC).
Run multiple and independent experiments. Create two programs, where the one will use the first algorithm, while the other will use the second, competitive, algorithm.
Increase the size of the iterations, so that you will be sure that OS overheads are negligible.

For your particular case, godbolt optimizes away the loops completely. You didn't see it because you didn't compile with optimization flags enabled. Moreover, a test like yours is completely incapable of measuring anything useful about the C + operator, because it will compile to different instructions in different contexts on the same machine.

edited Sep 28 '17 at 18:53

answered Sep 27 '17 at 13:18

gsamaras

71,951
46
188
305

I tested gcc, clang, and MSVC on https://godbolt.org/g/FqTjHS. They all optimize away the loop completely (which isn't a surprise, since the inputs are compile-time constants!). So yeah, the loop is probably just bottlenecked on store-forwarding latency. – Peter Cordes Sep 27 '17 at 13:22
A test like this is completely incapable of measuring anything useful about the C `+` operator, because it will compile to different instructions in different contexts on the same machine. The first part of your answer is right, but the second part isn't useful for this question. The approach is fundamentally flawed. – Peter Cordes Sep 27 '17 at 13:23
@PeterCordes I injected your comments into my answer, how does it look now? :) – gsamaras Sep 27 '17 at 13:29
Your 2nd bullet point is still using this nonsense as an example. It's good if the code under test was large enough to dominate the loop overhead, but that's not the case here. – Peter Cordes Sep 27 '17 at 13:37
I agree @PeterCordes, that's why I said in general, and then mention specifically for this case what holds. Isn't that clear enough? If not please let me know. – gsamaras Sep 28 '17 at 18:13
I added https://stackoverflow.com/questions/2349776/how-can-i-benchmark-c-code-easily back into the list of duplicates to answer the general case. Make your bullet point more generic, like "one will use version `a`, while the other will use version `b`. As written, right now you're suggesting that `+` vs. `&` is an example of something you *can* benchmark this way. – Peter Cordes Sep 28 '17 at 18:40

Ray · Answer 2 · 2017-09-27T13:59:32.727

0

The reason the timings vary is because running your program isn't the only thing the computer is doing. There are several things that might cause the program to take slightly longer or shorter:

Other programs need to be given the CPU. Even if you aren't running other programs, the operating system will be. Any time it decides to switch contexts, there's a fair amount of work it needs to do, like backing up the registers to memory and restoring the other program's state. This is likely the dominant factor for your program.
Cache misses: not likely to be a major issue here, since most of your memory accesses are going to be in one of two pages: the top of the stack and the region where the loader puts your constants. But those might not be in cache the first time they're loaded, and that would make a difference. In a more complicated program, this would likely be a more significant factor.
Blocking system calls. Any function call that depends on a physical resource (network, file access, etc) may need to delay until that resource becomes available. Depending on what other programs are using that resource, the time this takes could vary.

The solution isn't to try to control for all these; that isn't practical. Just run the benchmark a few million times to average out all the noise. (My rule of thumb is to keep adding zeros to the count until it takes at least 20-30 seconds to finish, then run it twice and check that the results are roughly the same). That'll give you a good idea of how long it'll usually take, and if you also calculate the variance or standard deviation, you know how much that'll vary from case to case.

edited Sep 27 '17 at 13:59

answered Sep 27 '17 at 12:55

Ray

1,706
22
30

1

Also turbo and power-saving CPU frequency changes, since the OP is using wall-clock time instead of performance counters to count core clock cycles. But more importantly, the OP is mostly just measuring loop overhead. – Peter Cordes Sep 27 '17 at 13:19
If you use `volatile counter`, you'll just be measuring store-forwarding latency. Also, using `clock()` inside the loop is totally wrong. `clock()` itself is *much* slower than the code under test here. If you want to measure `+` throughput or latency, you need an unrolled loop. Writing a microbenchmark like this in C just doesn't work if you really want to be measuring the x86 `add` or `lea` instructions vs. x86 `and` instructions. – Peter Cordes Sep 27 '17 at 13:40
This is the same problem as in https://stackoverflow.com/questions/46433208/which-is-faster-imm64-or-m64-for-x86-64, which is asking about `mov rax, imm64` vs. `mov rax, [memory]` for getting constants into register. At least the OP there wrote functions in asm, but [the test method still buried the results](https://stackoverflow.com/questions/46433208/which-is-faster-imm64-or-m64-for-x86-64#comment79853060_46433208). – Peter Cordes Sep 27 '17 at 13:42
@PeterCordes Good point; I've removed that part of the answer. I would have assumed that the volatile counter wouldn't make a difference since it was outside the sampled region, but you're almost certainly right about the overhead from `clock()` being more significant than the loop overhead. Not much the OP *can* do to keep the benchmark mechanism from dominating with such a small amount of actual code, though. – Ray Sep 27 '17 at 14:03
I replied to the part about `volatile` before I noticed you had `clock()` inside the loop. It's mostly true that keeping a loop counter in memory wouldn't affect a loop body that used `call clock` / `and edx, ebp` / `call clock` (plus much more math for the `clock` results)... x86's `rdtsc` instruction isn't serializing, though, so out-of-order execution can still overlap iterations. It's slow enough that a `volatile` loop counter probably wouldn't still be the bottleneck, though. – Peter Cordes Sep 27 '17 at 14:28

What is the most efficient and correct way to measure the time it takes for some instructions to execute?

2 Answers2