0

I executed the code from this famous topic Why is processing a sorted array faster than processing an unsorted array?

On my Mac OS Mojave:

//file test.cpp
#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {   // Primary loop
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '\n';
    std::cout << "sum = " << sum << '\n';
}

I compiled it with optimization flags:

g++ -std=c++11 test.cpp -O3 -march=native -o test.out

After running the versions with and without sorting, I found out that my execution times were very close, ~0.37 s. I've tried the following things:

  1. Used clang instead of g++. Same result.
  2. Inserted std::srand(std::time(nullptr)); before random number generator to have a randomized seed. Same result.

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Mikhail Genkin
  • 3,247
  • 4
  • 27
  • 47
  • Show us the assembly that was generated for this loop. I suspect there isn't actually a conditional jump in there but rather a conditional move, which is always fast, no matter the data. – Jonathan S. May 20 '22 at 15:39
  • 1
    Is it possible that the compiler has figured out that the behavior of your inner `for` loop doesn't actually depend on the value of `i` in the outer loop? – Nathan Pierson May 20 '22 at 15:41
  • @JonathanS. I generated assembly code, but it is a long sequence and I can't find a relevant part. – Mikhail Genkin May 20 '22 at 15:57
  • @NathanPierson I increased the array size by a factor of 10000 and only retained 1 iteration in the outer loop. In this case, both versions also output approximately equal results (0.13 seconds). – Mikhail Genkin May 20 '22 at 15:59
  • 2
    @JonathanS.: clang auto-vectorizes this loop, so yes there's no data-dependent branching. Modern GCC should do the same, or on MacOS the `g++` command might actually *be* a symlink to `clang++`. [Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?](https://stackoverflow.com/q/66521344) is linked from the bottom of the famous question this one cites. – Peter Cordes May 20 '22 at 16:05
  • 1
    @NathanPierson: You're suggesting that maybe the compiler hoisted the work and only traversed the array once? One pass over an array size of `32768`, then 100000 iterations of `sum += onepass` wouldn't explain the 0.37s total time. Computers aren't that slow. – Peter Cordes May 20 '22 at 16:09

0 Answers0