Performance improves with more instructions in the loop body (unoptimized code)

Question

I am trying to research what is the most efficient way to do value assignment in C/C++. And have some funny findings.

Suppose I want to assign value to a variable.

Firstly, I directly assign the constant value to the variable.

auto start = high_resolution_clock::now(); 
for (int index = 0; index < 100000; ++index)
{
    a = 1;
}
auto stop = high_resolution_clock::now();

The corresponding assembly code is

Run in Visual Studio 2019 of my laptop, it takes about 190 microseconds.

Then I change the code to

int c = 1;
auto start = high_resolution_clock::now(); 
for (int index = 0; index < 100000; ++index)
{
    a = c;
}

The running time jumps to 240 microseconds, which is as expected because there is extra time to get value of c, see as below assembly code.

Now, I change the code to read value from a static array.

int b[] = { 1,2 };
auto start = high_resolution_clock::now(); 
for (int index = 0; index < 100000; ++index)
{
    a = b[0];
}

There are two more instructions to find address of array item.

But the funny thing is: the execution time drops to 230 microseconds, a little lower than the second case.

Can anybody help explain why the time of the third case is less than the second case?

Note: I run each case several times, and the time is average of several runs.

Benchmark results on unoptimised code are quite meaningless. Make an example with optimised code and you'll have much clearer figures. There is a well-known store-forwarding quirk on some microarchitectures that could explain the issues you observe. What microarchitecture are you running this on? — fuz, Jul 03 '20 at 18:42
In order to eliminate the effect of time slicing, take the _minimum_ time of all the runs. You may also wish to experiment with the number of loop iterations. Too small and the overhead of the timing functions enters into it. Too large and you'll definitely take an OS reschedule/timeslice. The overhead of incrementing/comparing `index` is significant vs. the [line of] code you're trying to measure. — Craig Estey, Jul 03 '20 at 18:43
@wyfeizj [Microarchitecture](https://en.wikipedia.org/wiki/Microarchitecture); there are a lot of implementations of x86-64. — Thomas Jager, Jul 03 '20 at 18:50
@ThomasJager I am not that familiar with Micro-architecture, but My laptop CPU is Intel(R) Core(TM) i7-6920HQ. — wyfeizj, Jul 03 '20 at 18:59
@CraigEstey I take the minimum time, but try different loop numbers, from 10*1000 to 1000*1000*1000, but in each case, the third method takes less time than the second one. — wyfeizj, Jul 03 '20 at 19:01
@TonyK I think I do not have any way to measure the execution time of a single instruction. I log the time before loop start, and then use current time subtracting the logged time to get the execution time( auto duration = duration_cast(stop - start); in my code. — wyfeizj, Jul 03 '20 at 19:08
As fuz mentioned, benchmarking _unoptimized_ code is meaningless. For example, if optimized, `a = c` would only fetch `c` _once_. And, the loop might get optimized away completely since it does the same thing on each iteration. You might need `volatile`. But, what are you trying to achieve? Are you trying to (e.g.) set `a` to the value of 1 and want to know if there is something faster than `a = 1` [literally]? But, the simplicity of `a = 1` wins. I wouldn't sweat something this simple [partly because it's difficult to measure accurately]. The optimizer _will_ figure this out for you. — Craig Estey, Jul 03 '20 at 19:36
Thanks, but the key is not how to assign 1 to a variable, I want to know why the third case takes less time than the second case. I want to know the things on the back. — wyfeizj, Jul 03 '20 at 19:51
@fuz: I think you're right about store-forwarding. The effect is on the loop *counter* which the OP didn't show, but is probably `add [rbp + i], 1`. The loop *body* is loading one location, storing a different one; many separate dependency chains for OoO exec. But the loop counter update will create a loop-carried dependency chain whose latency depends on store-forwarding. — Peter Cordes, Jul 03 '20 at 20:59

Performance improves with more instructions in the loop body (unoptimized code)

0 Answers0