2

I wrote a simple program to see the execution performance for parallel algorithm. Here is the code :

#include <execution>
#include <vector>
#include <numeric>
#include <iostream>
#include <chrono>

int main()
{
    std::vector<float> data(1000000, 0);
    std::iota(std::begin(data), std::end(data), 0);

    auto t1 = std::chrono::high_resolution_clock::now();

    for (auto& item : data) {
        item = item*item;
    }

    auto t2 = std::chrono::high_resolution_clock::now();

    /* Getting number of milliseconds as a double. */
    std::chrono::duration<double, std::milli> ms_double = t2 - t1;
    std::cout << "non-optimized version : " << ms_double.count() << " milisecs" << std::endl;

    std::iota(std::begin(data), std::end(data), 0);

    t1 = std::chrono::high_resolution_clock::now();

    std::for_each(std::execution::par, std::begin(data), std::end(data),
                  [](float& item) {
        item = item*item;
    });

    t2 = std::chrono::high_resolution_clock::now();

    ms_double = t2 - t1;
    std::cout << "paralell version : " << ms_double.count() << " milisecs" << std::endl;

    return 0;
}

But to my surprise I see no improvement at all - regardless of amount of data in vector. What's wrong with STL algorithms. The compiler is gcc-10

Dmitry
  • 1,912
  • 2
  • 18
  • 29

1 Answers1

2

Some standard libraries have not implemented the parallel versions yet. In the case of GCC (libstdc++), some algorithms were implemented a while ago but they depended on Intel's TBB library.

MSVC standard library is particularly ahead on this, and you will get an actual parallel version, so you can try that one if you are on Windows.

See Are C++17 Parallel Algorithms implemented already? for more information.


ps. your case can be vectorized, so you can use std::execution::par_unseq.

Acorn
  • 24,970
  • 5
  • 40
  • 69