Performance loss parallel for

Question

I have a program that more or less does some vector operations repeated times. When I tried to use parallel_for to do the same tasks in parallel I observed a significant time increase per task. Each task reads from the same data and there is no synchronization going on. Here is the example code (it requires the Taskflow library (https://github.com/cpp-taskflow/cpp-taskflow):

#include <array>
#include <numeric>
#include <x86intrin.h>
#include "taskflow.hpp"

//#define USE_AVX_512 1
constexpr size_t Size = 5000;
struct alignas(64) Vec : public std::array<double, Size> {};

struct SimulationData
{
    Vec a_;
    Vec b_;
    Vec c_;

    SimulationData()
    {
        std::iota(a_.begin(), a_.end(), 10);
        std::iota(b_.begin(), b_.end(), 5);
        std::iota(c_.begin(), c_.end(), 0);
    }
};
struct SimulationTask
{
    const SimulationData& data_;
    double res_;
    double time_;
    explicit SimulationTask(const SimulationData& data)
    : data_(data), res_(0.0), time_(0.0)
    {}

    constexpr static int blockSize = 20000;
    void sample()
    {
        auto tbeg = std::chrono::steady_clock::now();
        Vec result;
        for(auto i=0; i < blockSize; ++i)
        {
            add(result.data(), data_.a_.data(), data_.b_.data(), Size);
            mul(result.data(), result.data(), data_.c_.data(), Size);
            res_ += *std::max_element(result.begin(), result.end());
        }
        auto tend = std::chrono::steady_clock::now();
        time_ = std::chrono::duration_cast<std::chrono::milliseconds>(tend-tbeg).count();
    }
    inline double getResults() const
    {
        return res_;
    }
    inline double getTime() const
    {
        return time_;
    }
    static void add( double* result, const double* a, const double* b, size_t size)
    {
        size_t i = 0;
        // AVX-512 loop
        #ifdef USE_AVX_512
        for( ; i < (size & ~0x7); i += 8)
        {
            const __m512d kA8   = _mm512_load_pd( &a[i] );
            const __m512d kB8   = _mm512_load_pd( &b[i] );

            const __m512d kRes = _mm512_add_pd( kA8, kB8 );
            _mm512_stream_pd( &result[i], kRes );
        }
        #endif
        // AVX loop
        for ( ; i < (size & ~0x3); i += 4 )
        {
            const __m256d kA4   = _mm256_load_pd( &a[i] );
            const __m256d kB4   = _mm256_load_pd( &b[i] );

            const __m256d kRes = _mm256_add_pd( kA4, kB4 );
            _mm256_stream_pd( &result[i], kRes );
        }

        // SSE2 loop
        for ( ; i < (size & ~0x1); i += 2 )
        {
            const __m128d kA2   = _mm_load_pd( &a[i] );
            const __m128d kB2   = _mm_load_pd( &b[i] );

            const __m128d kRes = _mm_add_pd( kA2, kB2 );
            _mm_stream_pd( &result[i], kRes );
        }

        // Serial loop
        for( ; i < size; i++ )
        {
            result[i] = a[i] + b[i];
        }
    }
    static void mul( double* result, const double* a, const double* b, size_t size)
    {
        size_t i = 0;
        // AVX-512 loop
        #ifdef USE_AVX_512
        for( ; i < (size & ~0x7); i += 8)
        {
            const __m512d kA8   = _mm512_load_pd( &a[i] );
            const __m512d kB8   = _mm512_load_pd( &b[i] );

            const __m512d kRes = _mm512_mul_pd( kA8, kB8 );
            _mm512_stream_pd( &result[i], kRes );
        }
        #endif
        // AVX loop
        for ( ; i < (size & ~0x3); i += 4 )
        {
            const __m256d kA4   = _mm256_load_pd( &a[i] );
            const __m256d kB4   = _mm256_load_pd( &b[i] );

            const __m256d kRes = _mm256_mul_pd( kA4, kB4 );
            _mm256_stream_pd( &result[i], kRes );
        }

        // SSE2 loop
        for ( ; i < (size & ~0x1); i += 2 )
        {
            const __m128d kA2   = _mm_load_pd( &a[i] );
            const __m128d kB2   = _mm_load_pd( &b[i] );

            const __m128d kRes = _mm_mul_pd( kA2, kB2 );
            _mm_stream_pd( &result[i], kRes );
        }

        // Serial loop
        for( ; i < size; i++ )
        {
            result[i] = a[i] * b[i];
        }
    }
};

int main(int argc, const char* argv[])
{
    int numOfThreads = 1;
    if ( argc > 1 )
        numOfThreads = atoi( argv[1] );

    try
    {
        SimulationData data;
        std::vector<SimulationTask> tasks;
        for (int i = 0; i < numOfThreads; ++i)
            tasks.emplace_back(data);

        tf::Taskflow tf;
        tf.parallel_for(tasks, [](auto &task) { task.sample(); });
        tf.wait_for_all();
        for (const auto &task : tasks)
        {
            std::cout << "Result: " << task.getResults() << ", Time: " << task.getTime() << std::endl;
        }
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
    }
    return 0;
}

I compiled this code with g++-8.2 -std=c++17 -mavx -o timing -O3 timing.cpp -lpthread on a dual E5-2697 v2 (each CPU has 12 physical cores with hyper threading, so there are 48 hardware threads available). When I increase the number of parallel tasks the timings for each task increase quite a lot:

# ./timing 1
Result: 1.0011e+12, Time: 618

Using 12 tasks:

# ./timing 12
Result: 1.0011e+12, Time: 788
Result: 1.0011e+12, Time: 609
Result: 1.0011e+12, Time: 812
Result: 1.0011e+12, Time: 605
Result: 1.0011e+12, Time: 808
Result: 1.0011e+12, Time: 1050
Result: 1.0011e+12, Time: 817
Result: 1.0011e+12, Time: 830
Result: 1.0011e+12, Time: 597
Result: 1.0011e+12, Time: 573
Result: 1.0011e+12, Time: 586
Result: 1.0011e+12, Time: 583

Using 24 tasks:

# ./timing 24
Result: 1.0011e+12, Time: 762
Result: 1.0011e+12, Time: 1033
Result: 1.0011e+12, Time: 735
Result: 1.0011e+12, Time: 1051
Result: 1.0011e+12, Time: 1060
Result: 1.0011e+12, Time: 757
Result: 1.0011e+12, Time: 1075
Result: 1.0011e+12, Time: 758
Result: 1.0011e+12, Time: 745
Result: 1.0011e+12, Time: 1165
Result: 1.0011e+12, Time: 1032
Result: 1.0011e+12, Time: 1160
Result: 1.0011e+12, Time: 757
Result: 1.0011e+12, Time: 743
Result: 1.0011e+12, Time: 736
Result: 1.0011e+12, Time: 1028
Result: 1.0011e+12, Time: 1109
Result: 1.0011e+12, Time: 1018
Result: 1.0011e+12, Time: 1338
Result: 1.0011e+12, Time: 743
Result: 1.0011e+12, Time: 1061
Result: 1.0011e+12, Time: 1046
Result: 1.0011e+12, Time: 1341
Result: 1.0011e+12, Time: 761

Using 48 tasks:

# ./timing 48
Result: 1.0011e+12, Time: 1591
Result: 1.0011e+12, Time: 1776
Result: 1.0011e+12, Time: 1923
Result: 1.0011e+12, Time: 1876
Result: 1.0011e+12, Time: 2002
Result: 1.0011e+12, Time: 1649
Result: 1.0011e+12, Time: 1955
Result: 1.0011e+12, Time: 1728
Result: 1.0011e+12, Time: 1632
Result: 1.0011e+12, Time: 1418
Result: 1.0011e+12, Time: 1904
Result: 1.0011e+12, Time: 1847
Result: 1.0011e+12, Time: 1595
Result: 1.0011e+12, Time: 1910
Result: 1.0011e+12, Time: 1530
Result: 1.0011e+12, Time: 1824
Result: 1.0011e+12, Time: 1588
Result: 1.0011e+12, Time: 1656
Result: 1.0011e+12, Time: 1876
Result: 1.0011e+12, Time: 1683
Result: 1.0011e+12, Time: 1403
Result: 1.0011e+12, Time: 1730
Result: 1.0011e+12, Time: 1476
Result: 1.0011e+12, Time: 1938
Result: 1.0011e+12, Time: 1429
Result: 1.0011e+12, Time: 1888
Result: 1.0011e+12, Time: 1530
Result: 1.0011e+12, Time: 1754
Result: 1.0011e+12, Time: 1794
Result: 1.0011e+12, Time: 1935
Result: 1.0011e+12, Time: 1757
Result: 1.0011e+12, Time: 1572
Result: 1.0011e+12, Time: 1474
Result: 1.0011e+12, Time: 1609
Result: 1.0011e+12, Time: 1394
Result: 1.0011e+12, Time: 1655
Result: 1.0011e+12, Time: 1480
Result: 1.0011e+12, Time: 2061
Result: 1.0011e+12, Time: 2056
Result: 1.0011e+12, Time: 1598
Result: 1.0011e+12, Time: 1630
Result: 1.0011e+12, Time: 1623
Result: 1.0011e+12, Time: 2073
Result: 1.0011e+12, Time: 1395
Result: 1.0011e+12, Time: 1487
Result: 1.0011e+12, Time: 1854
Result: 1.0011e+12, Time: 1569
Result: 1.0011e+12, Time: 1530

Is something wrong with this code? Is vectorization a problem with parallel_for? Can I get better insight using perf or a similar tool?

What are the numbers for 24 threads? Might be just poor performance of intel's HT. — Dan M., Sep 04 '18 at 11:53
By the way you should probably merge the add/mul/max steps and do all of them at once, save 2/3rd of the loads and almost all the stores - at least, if this an actual task, and not just a synthetic load for testing. — harold, Sep 04 '18 at 12:15
Is it your intention that the compiler throws away all but one of those vectorized loops? If you look at [the produced assembly](https://godbolt.org/z/e6H40r) (search for the `dummy` assignments to understand which code lines go where) you can see that all but the topmost vectorized loop are eliminated - the compiler knows that the results of all versions are identical so it only keeps the fastest one. — Max Langhof, Sep 04 '18 at 12:45
Well, it is a simplified example. In the real task there are random numbers generated (each task has its own generator) so each loop produces a different result. But there are some adds and multiplications etc. on each vector and I could reproduce the timing differences with this simple example. — Max, Sep 04 '18 at 13:04
@Max Again though, you are aware that the compiler throws out both the `Serial loop` code and the `SSE2 loop` code, right? It recognizes that those are less efficient variants than (and have results identical to) the `AVX loop`. — Max Langhof, Sep 04 '18 at 14:10
Yes, because Size is a multiple of 4 and known at compile time. When I pick 5003 the serial loop is used, but the SSE2 loop is still removed. — Max, Sep 04 '18 at 14:22
You'll start having issues if you want to initialize `data` with non-default values. You `emplace_back()`, and thus move away that variable several times in a row. — Michaël Roy, Sep 05 '18 at 19:05
Would you mind posting this on the cpp-taskflow's issue tracker? Our development team will help you with this. — Jes, Nov 19 '18 at 20:50

Max Langhof · Accepted Answer · 2018-09-04T12:20:35.450

Hyperthreading exists because threads (in real world scenarios) frequently have to wait for data from memory, leaving the physical core essentially idle while data is in transit. Your example (and also the CPU, e.g. through prefetching) is trying hard to avoid this memory-boundness, so by saturating the number of threads, any two hyperthreads on the same core are competing for its execution ports. Note how there are only 3 integer vector ALUs available per core cycle on your CPUs - the scheduler can probably keep them all busy with the operations of one thread alone.

With 1 thread or 12 threads you won't really run into this contention. With 24 threads, you will only avoid this problem if each thread is scheduled to its own physical core, which probably doesn't happen (so you start seeing worse timings). With 48 cores you definitely get the above problem.

As harold mentioned, you might also be store bound (yet another resource that hyperthread pairs compete over).

Gnat · Answer 2 · 2018-12-04T20:33:16.673

You would probably need Intel VTune to prove it, but I’m guessing that because the worker threads aren’t doing a lot of computational work between loads and stores, they are instead limited by the speed at which the CPU can load data from RAM. Hence the more threads you have, the more they compete for and starve each other of limited memory bandwidth. As the document Detecting Memory Bandwidth Saturation in Threaded Applications from Intel states:

As an increasing number of threads or processes share the limited resources of cache capacity and memory bandwidth, the scalability of a threaded application can become constrained. Memory-intensive threaded applications can suffer from memory bandwidth saturation as more threads are introduced. In such cases, the threaded application won’t scale as expected, and performance can be reduced. … The clear symptom of bandwidth saturation for any parallel application is non-scaling behavior.

Profiling with a tool like VTune is the only way to be certain of where the bottleneck is. VTune's specialty is that it can analyse performance at the CPU hardware level, and being an Intel tool it has access to performance counters and insights that other tools may not and therefore reveal bottlenecks as the CPU sees them. For AMD CPUs the equivalent tool is CodeXL. Additional tools that may be of use include Performance Counter Monitor (from https://stackoverflow.com/a/4015983) and, if running Windows, Visual Studio's CPU profiler (from https://stackoverflow.com/a/3489965).

To analyse performance bottlenecks at an instruction level, Intel Architecture Code Analyzer may be of use. It’s a static analyser that performs theoretical analysis of throughput, latency and data dependencies for a given Intel architecture. However, the estimates exclude effects from memory, cache and so on. For more information see What is IACA and how do I use it?.

I would be skeptical of data _loading_ being the bottleneck. This is the most cache- and prefetch-friendly task you could imagine, and there is basically no contention whatsoever on the loading side. I think there is some argument to be made about stores, but you're correct that detailed profiling is the only way to be sure. — Max Langhof, Sep 04 '18 at 12:45
You may be right—as I said, I’m only guessing. But the processor in question has ~60GB/s bandwidth (https://ark.intel.com/products/75283/Intel-Xeon-Processor-E5-2697-v2-30M-Cache-2_70-GHz), which, if we’re in the ballpark of the tests at http://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html, can be saturated by 6-7 threads per CPU. There are also similar questions with limited arithmetic: https://stackoverflow.com/q/25179738/478380, https://stackoverflow.com/a/18159503/478380. Profiling is the only way to be sure. — Gnat, Sep 04 '18 at 21:11

Performance loss parallel for

2 Answers2