In "loop unrolling", are all unrolled expressions executed?

Question

I always believed that the function below foo2 is faster than foo3 untile after a test.

All code below:

#include <iostream>
#include <boost/timer.hpp>
#include <boost/lexical_cast.hpp>
#include <stdint.h>

struct session {
    bool operator==(const session& r) const;

    uint8_t proto;
    uint16_t sport;
    uint16_t dport;
    uint32_t sip;
    uint32_t dip;
};

bool session::operator==(const session& r) const {
    return proto == r.proto && sport == r.sport && dport == r.dport 
        && sip == r.sip && dip == r.dip;
}

// my L1,L2,L3 total cache size is 16MB, so set it 32MB to overflow all 16MB caches.
static const int SIZE = 32 * 1024 * 1024 / sizeof(session);
int sum;

void foo1(session* p) {
    session s = {1, 2, 3, 4, 5};
    for (int i = 0; i < SIZE; i++)
        if (p[i] == s)
            sum++;
}

void foo2(session* p) {
    session s = {1, 2, 3, 4, 5};
    int n = SIZE - SIZE % 4;
    int i;

    for (i = 0; i < n; i += 4) {
        if (p[i + 0] == s)
            sum++;
        if (p[i + 1] == s)
            sum++;
        if (p[i + 2] == s)
            sum++;
        if (p[i + 3] == s)
            sum++;
    }
    /*
    for (; i < SIZE; i++)
            if (p[i] == s)
               sum++;
    */
}

void foo3(session* p) {
    session s = {1, 2, 3, 4, 5};
    int n = SIZE - SIZE % 4;
    int i;

    for (i = 0; i < n; i += 4) {
        if (p[i + 0] == s)
            sum++;
        else if (p[i + 1] == s)
            sum++;
        else if (p[i + 2] == s)
            sum++;
        else if (p[i + 3] == s)
            sum++;
    }
    /*
    for (; i < SIZE; i++)
            if (p[i] == s)
               sum++;
    */
}

int main(int argc, char* argv[]) {
    if (argc < 2)
        return -1;

    int n = boost::lexical_cast<int>(argv[1]);
    session* p = new session[SIZE];

    boost::timer t;
    for (int i = 0; i < n; i++)
        foo1(p);
    std::cout << t.elapsed() << std::endl;

    t.restart();
    for (int i = 0; i < n; i++)
        foo2(p);
    std::cout << t.elapsed() << std::endl;

    t.restart();
    for (int i = 0; i < n; i++)
        foo3(p);
    std::cout << t.elapsed() << std::endl;

    delete [] p;
    return 0;
}

test 1000 times, ./a.out 1000

output:

4.36
3.98
3.96

My machine:

CPU: Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz

Caches:

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 15360K

In the test, foo2 and foo3 have equivalence performance. Due to foo2 could case CPU execut all unrolled expressions in parallel, so foo3 is the same. It that right? If so, the else if syntax is violate the C/C++ basic else if semantics.

Is there someone explain it? Thanks a lot.

Update

My compiler is gcc 4.4.6 ins RedHat

g++ -Wall -O2 a.cpp

Did you use any optimization flags? Which compiler did you use? — Zeta, Aug 15 '13 at 09:41
@Zeta my compiler is gcc 4.4.6 in RedHat, with CFLAGS=-Wall -O2 — superK, Aug 15 '13 at 09:44
`foo`s 2 and 3 only work on a quarter of the input. I don't think you meant to _both_ divide `SIZE` by 4 _and_ count `i += 4`. — sh1, Aug 15 '13 at 10:54

Corbin · Accepted Answer · 2013-08-15T10:13:46.510

In certain situations, I would expect foo3 to be faster as it can short circuit (some number of branches less than or equal to 4 will occur, whereas in foo2, 4 branches always occurs). In the situation where s is not equal to any of the 4 array elements (as is extremely likely in this case), foo2 and foo3 are basically the same code. In that case, 4 branches happen in both functions.

Consider what foo3 really looks like (in terms of branches):

if (p[i + 0] == s)
    sum++;
else
    if (p[i + 1] == s)
        sum++;
    else 
        if (p[i + 2] == s)
            sum++;
        else 
            if (p[i + 3] == s)
                sum++;

This should make it apparent that as long as the if keep coming up false, the sub branches are going to happen. This means that in the situation where none of the ifs are true, it will execute the same number of operations as foo2 (though not the same functionality).

A crude way to think about it is as if each if has a cost (not the body of the if, the actual if). In other words, each time an if is reached in the execution flow, a certain cost is required. This is because a branch must be done. Thinking about it this way, it's clear to see that the cost of each function is the same when foo3's flow doesn't short circuit (when all 4 of foo3s if are encountered). (As KillianDS noted, if branch prediction is wrong, it will actually take longer for foo3 since the wrong branch will have to be rewound and the right one executed instead. It seems like for you though that the correct branch is always being chosen.)

It's kind of like how the following snippets of code can have the same performance:

if (short_runtime()) {}

And:

if (short_runtime() && long_runtime()) {}

If short_runtime returns true, the one with the second function call is obviously going to take longer. If the short_runtime() return is false though, the long_runtime() call will never happen, and thus the run times will be the same (or at least extremely similar).

To test out this theory, you can make it so that p[i + 0] == s will be true. Just value initialize the array (session* p = new session[SIZE]();), and use session s = {1, 2, 3, 4, 5}; locally.

There seems to be a bit of confusion about the purpose/result of loop unrolling. It's done so that fewer jumps have to happen. If n things have to be done, instead of n iterations (jumps) happening with 1 action per iteration, you can have n/k iterations (jumps) happen instead. When everything can fit in the cache, this can provide a speed boost (if it can't fit in the cache, it can actually kill performance!).

The instructions aren't happening simultaneously (if they were, sum would need a mutex around it which would be extremely expensive). They're simply happening in sets of 4 instead of sets of 1.

Yes. Your theory is right! I have tested it. It seems that we just save the long_runtime() and the branch miss time. — superK, Aug 15 '13 at 10:09

score 2 · Answer 2 · edited May 23 '17 at 11:49

This is branch prediction:

With your program I get these speeds (here foo3 is a bit slower, g++4.8):

7.57
0.63
0.99

Now what happens? You do not initialize your initial array of sessions, as all variables in session are POD's, they won't get default-initialized and will contain essentially garbage. Therefore the if's in your code will very quickly converge to always predicting the not-taken branch. In this case foo3 and foo2 are quite simi lar, foo2 will execute all if's unconditionally, foo3 will do it because it's predicted. I don't really see why foo3 is still a bit slower, I'll have to look at the disassembly code for that.

Now see what happens if I add the following default constructor:

session() : proto(1), sport(2), dport(3), sip(4), dip(5) {}

I of course also had to change the session variables in the foo's to session s; Now my timings become:

9.7
1.5
0.75

Suddenly foo3 is a lot faster. Simply because now the branches will be mostly (correctly) predicted as 'taken'. In the case of foo3 this means that only the first condition will be executed and the function exits quickly. foo2 still has to evaluate all branches, even if the prediction is good, which obviously makes it slower.

score 0 · Answer 3 · answered Sep 01 '13 at 05:10

foo2 and foo3 exhibits equal performance because there is only a differences of 0.02 ms or s in your output. You could have conducted multiple tests of foo2 and foo3 by using different session size; size of 10, 19, 20, 21, 50, 80, 99 etc. This will give you more output to determine if their performance is still equivalent or not.
In this problem you are trying to exploit the compiler by performing loop unrolling. The if and else if statement might not mean a lot to the compiler because it`s not part of the parallelism and optimization, however it might still help.

In "loop unrolling", are all unrolled expressions executed?

3 Answers3