How can I measure the speed difference of for loop?

Question

I am curious about the items below in for loop.

for(auto) vs for(auto &)
Separating the for loop
for(auto &) vs for(const auto &)
for(int : list) vs for(auto : list) [list is integer vector]

So, I wrote the below code for testing in the C++17 version.

It looks like seems difference in CMake debug mode(without optimization)

// In debug mode
1. elapsed: 7639 (1663305922550 - 1663305914911)
2. elapsed: 3841 (1663305926391 - 1663305922550)
3. elapsed: 3810 (1663305930201 - 1663305926391)

But in release mode(with gcc -O3) there is no difference between 1 ~ 3

// release mode
1. elapsed: 0 (1663305408984 - 1663305408984)
2. elapsed: 0 (1663305409984 - 1663305409984)
3. elapsed: 0 (1663305410984 - 1663305410984)

I don't know if my test method is wrong, Or is it correct that there is no difference depending on the optimization status?

Here is my testing source code.

// create test vector
const uint64_t max_ = 499999999;    // 499,999,999
std::vector<int>   v;
for (int i = 1; i < max_; i++)
    v.push_back(i);


// test 1.
auto start1 = getTick();
for (auto& e : v)
{
    auto t = e + 100;    t += 300;
}
for (auto& e : v)
{
    auto t = e + 200;    t += 300;
}
auto end1 = getTick();


// test 2.
// Omit tick function
for (auto& e : v)
{
    auto t1 = e + 100;    t1 += 300;
    auto t2 = e + 200;    t2 += 300;
}


// test 3.
for (auto e : v)
{
    auto t1 = e + 100;    t1 += 300;
    auto t2 = e + 200;    t2 += 300;
}

...

And then, getTick() was obtained through chrono milliseconds.

uint64_t getTick()
{
    return (duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count());
}

Also, this testing progressed on Debian aarch64

Jetson Xavier NX (jetpack 4.6, ubuntu 18.04LTS)
8Gb RAM
GCC 7.5.0

Please advise if there is anything wrong. Thank you!

Your loop bodies have no observable effects, so the entire loops can be optimized away. — molbdnilo, Sep 16 '22 at 05:45
You should check the generated assembly using Godbolt website. If the assembly is near/identical, you've got your answer. Also, when looking at such small difference, the natural latency jitter might even masquerade any performance differences. — user997112, Sep 16 '22 at 05:47
All your 3 test cases are effectively doing the same thing. Why do you expect a performance difference? And for the 1st case, the compiler probably removed the pass-by-reference and makes a pass-by-value (copy) since that's more efficient for primitive data types. — digito_evo, Sep 16 '22 at 05:48
Since __auto__ gets a copy of the vector element and __auto&__ gets a reference copy of the vector element, I thought that there would be a significant time difference if this was repeated close to infinity. — mystes, Sep 16 '22 at 05:52
@mystes You seem to miss the fact that passing by reference doesn't mean that no copies are made. In fact, your CPU passes (yes, a copy) an 8-byte pointer just to make that referencing possible. And even worse than that, it then goes on to dereference that pointer (`int*`) to access the value that it points to. Thus a huge performance hit. If you only passed by value then only a 4-byte `int` copy would happen. And no need for a costly dereference. — digito_evo, Sep 16 '22 at 05:58
@digito_evo That's right. In the case of int, as you said, delivering a copy (from 64 bits to 8 bytes) would have a greater loss. In fact, what I'm using is not "int" but "a structure" that is much larger than 8 bytes in size through a vector. In this case, will the memory usage of for __auto&__ be less than that of for __auto__, and is it correct that the speed is the same? — mystes, Sep 16 '22 at 06:03
@molbdnilo Thank you. I didn't know that compiler optimizes observable effects. What are some examples of __observable effects__? I put code such as "t + = 300" to prevent this, but it doesn't seem to work. — mystes, Sep 16 '22 at 06:07
With optimization enabled, copying to a local object can remove most of the work if you only use one member of that copy. Since C++ requires optimization for acceptable performance (especially with template library functions), get used to thinking of what real work actually needs to happen for the code; often a compiler will figure out what that minimum is. `auto &` isn't actually going to put a pointer in a register and dereference it beyond what it was doing to loop over the array in the first place. (@digito_evo) — Peter Cordes, Sep 16 '22 at 06:07
Since `t` is local to the loop body, `t += 300;` can be removed – you never use the value for anything, so it doesn't matter since you can't tell the difference (that it's faster is not an observable effect of your program). Looking [here](https://godbolt.org/z/PEsdM8aTz), you can see that `-O3` optimizes your code to "create and fill the vector, then delete it". — molbdnilo, Sep 16 '22 at 06:12

score 2 · Accepted Answer · answered Sep 16 '22 at 06:24

An empty loop can optimize away, so your compiler correctly does that. But benchmarking with optimization disabled is not meaningful. C++ requires optimization to get the performance we expect for production use (especially with template library functions), and optimization or not isn't a constant factor speedup; it makes different ways to express the same logic lead to different asm, when in a normally optimized build they'd compile to the same asm.

You can't infer anything from a debug build about what's faster in a release build, not about small-scale micro-optimization things like this. See also Idiomatic way of performance evaluation?

With optimization enabled, copying to a local object can remove most of the work if you only use one member of that copy. Get used to thinking of what real work actually needs to happen for the code; often a compiler will figure out what that minimum is. For example, auto & isn't actually going to put a pointer in a register and dereference it beyond what it was doing to loop over the array in the first place, the reference variable doesn't actually exist anywhere in the asm as a separate value in a register or memory.

So this isn't something you can isolate in a benchmark without some real work in the loop, e.g. summing an array, or modifying every element. You could try using something like Benchmark::DoNotOptimize or similar inline asm to make the compiler materialize a value in a register without doing anything else, but to be sure you're benchmarking exactly the right thing, you need to understand asm and check the compiler output. (Microbenchmarking is hard!) In which case you probably can already answer the question just by looking at the asm and seeing that it's the same either way in normal cases.

It's probably easier to just check which things all compile to the same asm with optimization enabled, instead of trying to guess whether small differences in experimental timing are due to noise or might be a real difference. (And if there is a difference, whether it's just a coincidence that it was faster with this luck of the draw for code alignment and surrounding code, on this particular CPU.)

How can I measure the speed difference of for loop?

1 Answers1