Why does sorting a vector with execution::par take longer than normal sort (gcc 10.1.0)?

Question

Consider this code:

#include <algorithm>
#include <chrono>
#include <cstdio>
#include <execution>
#include <functional>
#include <random>
#include <vector>
using namespace std;
using namespace std::chrono;

constexpr size_t NUM_OF_ELEMENTS = 30000000;

// execute lambda and print the execution time
void measure(function<void()> lambda)
{
    auto start = high_resolution_clock::now();
    lambda();
    auto end = high_resolution_clock::now();
    
    printf("%ld\n", duration_cast<microseconds>(end - start).count());
}

int main()
{
    random_device rd;
    mt19937_64 gen(rd());
    // range from INT_MIN to INT_MAX
    uniform_int_distribution<> distr(-2147483648, 2147483647);
    
    vector<int> original;
    original.reserve(NUM_OF_ELEMENTS);
    
    for(size_t i = 0; i < NUM_OF_ELEMENTS; i++)
        original.push_back(distr(gen));
    
    vector<int> the_copy(original.begin(), original.end());
    
    // sort with single thread
    measure([&]{ sort(original.begin(), original.end()); });
    
    // sort with execution::par
    measure([&]{ sort(execution::par, the_copy.begin(), the_copy.end()); });
    return 0;
}

The code can be summed up in a few points:

create random number generator
create vector of random integers
create a copy of that vector
sort original vector with one thread and measure the time of execution
sort the copy with std::execution::par and measure the time of execution
print the execution times

The execution::par version always takes longer. Doesn't matter what value NUM_OF_ELEMENTS has. I tried values from 100 000 to 30 000 000 incrementing by 100 000. The code above produces similar results like this (values in microseconds):

9729406    // single thread
10834613   // execution::par

I compiled the code on Windows 10 with gcc using VS Code: g++ -std=c++17 -g ${workspaceFolder}/main.cpp -o ${workspaceFolder}/main.exe

For C++ standard library I use mingw distribution which can be found here. Program versions: GCC 10.1.0 + LLVM/Clang/LLD/LLDB 10.0.0 + MinGW-w64 7.0.0

My processor has 6 cores and at the time of execution I didn't run any major programs or background processes.

First, I thought it has something to do with the size of the vector but 30 000 000 elements is surely enough. It already runs for 10 seconds before finishing a single test.

What is going on?
Is execution::par meant to be used like this?
Do I have to enable some compilation flags or some other trick to make it work as expected?

You shouldn't make performance comparisons of non-optimized code. Your performance results might change drastically with `-O2`. — Jorge Bellon, Jul 14 '20 at 17:09
What is the result if you swap lines of sort and par sort calls? — 273K, Jul 14 '20 at 17:09
@JorgeBellon Sadly that's not the case. With `-O2` it gives better times for both: `2111474 and 2172487` but `execution::par` is still slower. — sanitizedUser, Jul 14 '20 at 17:14
@S.M. If I swap lines the measured times are swaped. This means `execution::par` is still slower even if it runs first. With `-O2` it gives me `2179489 2118475` where first time is now the `execution::par` version. — sanitizedUser, Jul 14 '20 at 17:16
Mingw isn’t a very good library, do you notice the same problem with Microsoft’s standard library, or even better native libstdc++ or libc++? — Daniel, Jul 14 '20 at 17:26
@Mikhail I benchmarked a vector with 1 000 000 elements. Both single and multi thread there have the same execution time. You can look [here](https://quick-bench.com/q/qY7gcNC5OMv9hJcf52sg-DXcAb4). Although I benchmarked there for the first time so I may wrote it wrongly. — sanitizedUser, Jul 14 '20 at 17:37
The execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution **may be parallelized**. The invocations of element access functions in parallel algorithms invoked with this policy (usually specified as std::execution::par) **are permitted** to execute in either the invoking thread or in a thread implicitly created by the library to support parallel algorithm execution. — 273K, Jul 14 '20 at 17:50
I want say "may" is not "must". I mean the par sort may be executed in different threads but not in parallel. — 273K, Jul 14 '20 at 17:53
Throwing more threads at a problem doesn't guarantee *faster* code. Especially the more threads you use, the more task switches and such the OS has to perform, which can actually *slow down* code if you are not careful. — Remy Lebeau, Jul 14 '20 at 18:09
Multithreading isn't a magic bullet. I'd suggest a read of introduction to algorithms. Iirc it has a nice section on parallelization. — Taekahn, Jul 14 '20 at 19:59
@Taekahn I agree. But this is basic sorting. This should be faster with multiple threads. — sanitizedUser, Jul 14 '20 at 20:54
Basic sorting is the worst type of sorting to try and parallelize. — Taekahn, Jul 14 '20 at 21:46
@Taekahn why?? Should not having 15 different threads many that if you had 15 processors, each processor only needs to do 1/ 15 of the work so shouldn't it be 15*15 = 225 times faster?? — Yunfei Chen, Jul 15 '20 at 00:07
@YunfeiChen in a perfect ideal scenario, sure. In reality there is already stuff going on in the system. Plus, You need to take into account the time to create the thread, partition the work, combine the work across all the threads, and destroying the threads. — Taekahn, Jul 15 '20 at 01:21
Possibly relevant: https://stackoverflow.com/questions/22766191/what-is-false-sharing-how-to-reproduce-avoid-it — Stephen Newell, Jul 15 '20 at 04:48
@StephenNewell False sharing might cause this but `execution::par` is suplied by the standard library. Surely they know what they are doing and they will avoid it, right? — sanitizedUser, Jul 15 '20 at 09:12
I would agree with @S.M. Seems like the library decides not to parallelize anything at all. Can you take a look at your CPU utilization? Do you see that multiple cores are actually used? — Mikhail, Jul 15 '20 at 11:17
I tweaked your quickbench code a little to make sure the same data is used by both sorts and excluded the time needed for copying of the new vector: https://quick-bench.com/q/gl1LiLc27hpBiKsu6fFXprnql5k But the results are surprisingly consistent. Again, it seems to me that no parallelization actually happens. — Mikhail, Jul 15 '20 at 11:20
@Mikhail When looking at the performance tab in the task manager while the program is executing only one of the logical processors shows a spike to 20% for a brief moment. So it looks like no parallelization happens. But then I really don't understand what is `execution::par` good for. — sanitizedUser, Jul 15 '20 at 12:34

Taekahn · Accepted Answer · 2020-07-15T15:14:40.117

2

running perf on your code, it looks like it spends a tiny bit longer trying to partition the data.

This is just one example, but i ran it several times and consistently the parallel version was taking longer to partition the data at multiple levels of the sort. Since its recursive, its hard to get an exact picture of how much extra overhead it end up adding overall.

sort1 is the non-parallel sort.
sort2 is the parallel sort.

That aside, the answer to the underlying question is that you need intel thread building blocks installed in order for gcc to use anything other than the serial algorithms.
This can be installed quick simply on linux with sudo apt install libtbb-dev and then you link against it with -ltbb

edited Jul 15 '20 at 15:14

answered Jul 15 '20 at 03:15

Taekahn

1,592
1
13
16

Are you sure the multi-threaded sort actually used multiple threads? – Mikhail Jul 15 '20 at 11:17
@Mikhail No, all i can say for sure is that it is calling the parallel version, not that it gets run in parallel. Regardless, its the code posted in the question and the answer, at least on my machine, as to why it takes to run longer is because its taking longer to partition the data on every run i tested. – Taekahn Jul 15 '20 at 13:21
2

@Mikhail Digging through GCC's source, it looks like it will only select parallel execution if you have Thread Building Blocks installed. https://github.com/gcc-mirror/gcc/blob/f0d0be62db5ba030283fa8189211830d09dfb467/libstdc%2B%2B-v3/include/bits/c%2B%2Bconfig#L674 – Taekahn Jul 15 '20 at 13:54
@Taekahn I think the comment you just posted is the right answer to this question. It doesn't run in parallel because I don't have TBB installed. Btw do you know of some easy way how to install it? I have searched and people are having problems with it: [example issue](https://github.com/oneapi-src/oneTBB/issues/232). – sanitizedUser Jul 15 '20 at 14:31
I'll add it to the answer. As to getting it installed, on recent linux, its super simple. `sudo apt install libtbb-dev` then you just link against it. on windows? i can't really provide any guidance. – Taekahn Jul 15 '20 at 15:11

Why does sorting a vector with execution::par take longer than normal sort (gcc 10.1.0)?

1 Answers1