9

What is the overhead in splitting a for-loop like this,

int i;

for (i = 0; i < exchanges; i++)
{
    // some code
    // some more code
    // even more code
}

into multiple for-loops like this?

int i;

for (i = 0; i < exchanges; i++)
{
    // some code
}

for (i = 0; i < exchanges; i++)
{
    // some more code
}

for (i = 0; i < exchanges; i++)
{
    // even more code
}

The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)

I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.

Many thanks, in advance.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • that depends on the platform. If the cache is involved, then the degradation might be significant – BЈовић Dec 13 '12 at 19:29
  • also see http://www.agner.org/optimize/ – queen3 Dec 13 '12 at 19:30
  • 9
    The only answer to this is: profile it. – moswald Dec 13 '12 at 19:30
  • I suggest testing & measuring. – John Carter Dec 13 '12 at 19:30
  • In my experience, the second form can be faster, because the loop contents are simpler. Like, for example, it doesn't have to juggle registers so much inside the loop. – Mike Dunlavey Dec 13 '12 at 19:31
  • 1
    See also http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops/8547993#8547993 – Luchian Grigore Dec 13 '12 at 19:32
  • @moswald - Ah, I was hoping there'd be a straightforward answer, but I guess there isn't. I'll attempt to profile it, but as I have no experience in (rigorous) profiling, there's a greater chance I'll measure something wrong due to some misunderstanding, than I'll measure everything right. I'll report back if I come across any useful observations. – Andrew Cheong Dec 13 '12 at 19:43
  • @acheong87 Before I noticed Mystical's answer included test code, I posted my own profiling answer. (His answer goes into a lot of detail, and is much better.) – moswald Dec 13 '12 at 21:24

4 Answers4

14

There are often way too many factors at play... And it's easy to demonstrate both ways:

For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):

for (int c = 0; c < size; c++){
    data[c] *= 10;
    data[c] += 7;
    data[c] &= 15;
}

And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.

On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

The opposite is sometimes true due to memory alignment.


What to take from this?

Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.

And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.

"Each additional for-loop would cost the equivalent of two int allocations."

You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.

Not only is the performance context-dependent, but it also vary with different processors.


Here's the test code:

#include <time.h>
#include <iostream>
using namespace std;

int main(){

    int size = 10000;
    int *data = new int[size];


    clock_t start = clock();

    for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
        for (int c = 0; c < size; c++){
            data[c] *= 10;
            data[c] += 7;
            data[c] &= 15;
        }
#else
        for (int c = 0; c < size; c++){
            data[c] *= 10;
        }
        for (int c = 0; c < size; c++){
            data[c] += 7;
        }
        for (int c = 0; c < size; c++){
            data[c] &= 15;
        }
#endif
    }

    clock_t end = clock();
    cout << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
}

Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds

Community
  • 1
  • 1
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • 1
    this answer is wrong. I would like to post my answer but the question seems to be locked. I tried the two versions in godbolt and, as I expected, the resulting assembly code (with optimizations -O3) is exactly the same. – HAL9000 Dec 09 '19 at 15:55
  • @HAL9000 Note that this question is 7 years old. Compilers have certainly improved since then. – Mysticial Dec 09 '19 at 16:54
4

Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.

Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.

A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.

Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.

This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • 1. Decrease the number of jump instructions. Unroll, have the least possible number of loops. 2. Cache align your data, if not align cpu instruction take more cycles. 3. Use prefetch instructions if available to bring the data to L1 cache, so there is less cache miss. – Raunak Mukhia Oct 07 '14 at 15:25
2

This will give you a good indication of whether or not one version is faster than another.

#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>

const int iterations = 100;

namespace
{
    const int exchanges = 200;

    template<typename TTest>
    void Test(const std::string &name, TTest &&test)
    {
        typedef std::chrono::high_resolution_clock Clock;
        typedef std::chrono::duration<float, std::milli> ms;

        std::array<float, iterations> timings;

        for (auto i = 0; i != iterations; ++i)
        {
            auto t0 = Clock::now();

            test();

            timings[i] = ms(Clock::now() - t0).count();
        }

        auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
        std::cout << "Average time, " << name << ": " << avg << std::endl;
    }
}

int main()
{
    Test("single loop",
        []()
        {
            for (auto i = 0; i < exchanges; ++i)
            {
                // some code
                // some more code
                // even more code
            }
        });

    Test("separated loops",
        []()
        {
            for (auto i = 0; i < exchanges; ++i)
            {
                // some code
            }

            for (auto i = 0; i < exchanges; ++i)
            {
                // some more code
            }

            for (auto i = 0; i < exchanges; ++i)
            {
                // even more code
            }
        });
}
moswald
  • 11,491
  • 7
  • 52
  • 78
-3

The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...