Branch prediction does not improve the performance

Question

I executed the code from this famous topic Why is processing a sorted array faster than processing an unsorted array?

On my Mac OS Mojave:

//file test.cpp
#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {   // Primary loop
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '\n';
    std::cout << "sum = " << sum << '\n';
}

I compiled it with optimization flags:

g++ -std=c++11 test.cpp -O3 -march=native -o test.out

After running the versions with and without sorting, I found out that my execution times were very close, ~0.37 s. I've tried the following things:

Used clang instead of g++. Same result.
Inserted std::srand(std::time(nullptr)); before random number generator to have a randomized seed. Same result.

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Show us the assembly that was generated for this loop. I suspect there isn't actually a conditional jump in there but rather a conditional move, which is always fast, no matter the data. — Jonathan S., May 20 '22 at 15:39
Is it possible that the compiler has figured out that the behavior of your inner `for` loop doesn't actually depend on the value of `i` in the outer loop? — Nathan Pierson, May 20 '22 at 15:41
@JonathanS. I generated assembly code, but it is a long sequence and I can't find a relevant part. — Mikhail Genkin, May 20 '22 at 15:57
@NathanPierson I increased the array size by a factor of 10000 and only retained 1 iteration in the outer loop. In this case, both versions also output approximately equal results (0.13 seconds). — Mikhail Genkin, May 20 '22 at 15:59
@JonathanS.: clang auto-vectorizes this loop, so yes there's no data-dependent branching. Modern GCC should do the same, or on MacOS the `g++` command might actually *be* a symlink to `clang++`. [Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?](https://stackoverflow.com/q/66521344) is linked from the bottom of the famous question this one cites. — Peter Cordes, May 20 '22 at 16:05
@NathanPierson: You're suggesting that maybe the compiler hoisted the work and only traversed the array once? One pass over an array size of `32768`, then 100000 iterations of `sum += onepass` wouldn't explain the 0.37s total time. Computers aren't that slow. — Peter Cordes, May 20 '22 at 16:09

Branch prediction does not improve the performance

0 Answers0