Is branch prediction still significantly speeding up array processing?

Question

I was reading a interesting post about why is it faster to process a sorted array than an unsorted array? and saw a comment made by @mp31415 that said:

Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases

So I tried it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable.

So I wonder if it is still true for modern architectures? Question was from 2012, not so far from now... Or where am I wrong?

Question precision for reopening:

Please forget about me adding the C code as example. This was a terrible mistake. Not only erroneous was the code, posting it misled people who were focusing on the code itself, rather than on the question.
When I tried first the C++ code used in the link above and got only 2% difference (1.9s & 1.85s).
My first question and intent was about the previous post, its c++ code and the comment of @mp31415.
@rustyx made an interesting comment, and I wondered if it could explain what I observed.

Interestingly, a debug build exhibits 400% difference between sorted/unsorted, and a release build at most 5% difference (i7-7700).

In other words, my question is:

Why does the c++ code in the previous post did not worked with as good performances as those claimed by the previous OP?

precised by:

Does the timing difference between the release build and debug build could explain it?

Not very random, I suspect, given the `GetMyRand()` function that is presented here. — Robert Harvey, Apr 11 '19 at 14:45
I have my doubts that this is an architectural change, so your entire premise may be invalid. Have you tried running the code in the original question? — Robert Harvey, Apr 11 '19 at 14:47
When I try this, `data` contains about 80 to 90 distinct values, most of them repeating consecutively over a thousand times. Large contiguous portions of your `data` are already sorted just by how it's generated. — François Andrieux, Apr 11 '19 at 14:48
A single test case does not justify your assumption that somehow the code runs differently with modern architeuctures. `GetTimeStamp()` called quickly a few times in succession will return similar values each time, which means your `GetMyRand()` is not particularly random. Your results are therefore just as likely to be a result of your choice of data as anything else. — Peter, Apr 11 '19 at 14:49
When I replace your `GetMyRand` with `rand()%256` I get a different result. http://quick-bench.com/AF6rYHLXqgrjL5e-iDeaXqpFdQY — tkausl, Apr 11 '19 at 14:49
Even if GetMyRand() % 0xff.... returned a good random unsigned int value, comaring it '< 128' would be false in nearly all cases. Different with % 256 ... — Ingo Leonhardt, Apr 11 '19 at 14:52
Apparently whatever the optimizer is doing really makes a difference. Your observation is born out [here](http://quick-bench.com/mxcNDmyGyVpf-OWx_txj5aMpl2Y), but If I turn of optimizations, I [get the expected results](http://quick-bench.com/cjPETRaw4YtO8e6_86moufAQa4Y) — NathanOliver, Apr 11 '19 at 14:55
@FrançoisAndrieux I did not use the OP's example. The code is taken from the canonical Q and uses `rand` — NathanOliver, Apr 11 '19 at 14:57
@NathanOliver the optimizer just removes `sum` since it's not used. You have to send it through `benchmark::DoNotOptimize()`: http://quick-bench.com/rkDevVNsfZbudFHe8w5jJ4ZZeOQ — tkausl, Apr 11 '19 at 14:58
@tkausl Thanks for that. I though I was missing something. Looks like the OP's random generator is to blame then. — NathanOliver, Apr 11 '19 at 14:59
Its not the RNGs fault, @IngoLeonhardt gave the correct answer, most - if not all - numbers will be >= 128. He's missing the %256. — tkausl, Apr 11 '19 at 15:07
You. You never %256 the number you get from your RNG. `% (unsigned)0xFFFFFFFFFFFF` is pointless. — tkausl, Apr 11 '19 at 15:12
anyway even with trying with %256 or with no % give me the same results. — Guillaume D, Apr 11 '19 at 15:16
This is what I get when I %256 the number: http://quick-bench.com/CmA_vP4l7gxLbWMaE08ABUnmBKc — tkausl, Apr 11 '19 at 15:18
ok thank you, what about the c++ code given in the previous link? — Guillaume D, Apr 11 '19 at 15:23
I'm closing this question (and cleaning up the comments) because you have made a drastic change to the question *multiple times* after receiving an answer, including removing all of the source code that was shown in the question and to which the answer refers. The question has become an unclear, amorphous blob that no one can possibly answer---nor *should* they answer, because it's likely going to change again, if history is any guide. You need to decide what you're *actually* asking, present that information, and then leave it alone. — Cody Gray - on strike, Apr 11 '19 at 16:26
Yes. And then you changed the question out from underneath him. That's the problem. I am fully capable of answering a question about branch prediction and x86 architectures. In fact, I'm an expert on that particular subject. But I don't want to even try and answer this question, because you keep modifying the question as you go along. That doesn't work on Stack Overflow; this is a Q&A site, and the question needs to remain relatively stable once answers have been posted, otherwise the entire discussion becomes incoherent. — Cody Gray - on strike, Apr 11 '19 at 16:35
The question I first asked (which is stil in my post) is: ```For information, If I try it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable. So why isn't it true anymore with modern architectures? ``` This is my question, rustyx made some explaination, so I precise my question. I mean, why ain't I allowed to edit my post in order to precise what I'm trying to do? If I had the answer from the beginning my question would have been clearer, but I wouldn't have asked — Guillaume D, Apr 11 '19 at 16:38
I would recommend asking a new question. As you've noted, rustyx has answered your original question relating to the C code. That served to clarify your thinking, and prompted a *follow-up* question, which should naturally be asked as a new question. This isn't just semantics. There's a very real practical problem here. It gets confusing when questions completely change after answers have been received. rustyx's answer is dealing with code that is *no longer visible in the question*. If I, or anyone else, answer now, we'll be saying something *completely different* than rustyx is. — Cody Gray - on strike, Apr 12 '19 at 17:20

rustyx · Accepted Answer · 2019-04-12T18:39:15.570

You're a victim of the as-if rule:

... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...

Consider the function under test ...

const size_t arraySize = 32768;
int *data;

long long test()
{
    long long sum = 0;
    for (size_t i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (size_t c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return sum;
}

And the generated assembly (VS 2017, x86_64 /O2 mode)

The machine does not execute your loops, instead it executes a similar program that does this:

long long test()
{
    long long sum = 0;
    // Primary loop
    for (size_t c = 0; c < arraySize; ++c)
    {
        for (size_t i = 0; i < 20000; ++i)
        {
            if (data[c] >= 128)
                sum += data[c] * 5;
        }
    }
    return sum;
}

Observe how the optimizer reversed the order of the loops and defeated your benchmark.

Obviously the latter version is much more branch-predictor-friendly.

You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:

long long test()
{
    long long sum = 0;
    for (size_t i = 0; i < 100000; ++i)
    {
        sum += data[sum % 15];  // <== dependency!
        // Primary loop
        for (size_t c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return sum;
}

Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).

Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

Is branch prediction still significantly speeding up array processing?

Question precision for reopening:

1 Answers1