c++ std library parallel execution with for_each slower than sequential loop

Question

Since c++17, the std library has parallel algorithms, so I tried with the following code, summing a list of numbers and want to see if there is any performance gains.

#include <algorithm>
#include <chrono>
#include <execution>
#include <numeric>
#include <iostream>
#include <vector>

int main() {
  size_t n = 100000000;
  std::vector<size_t> vec(n);
  std::iota(vec.begin(), vec.end(), 0);

  auto par_sum = [&](size_t k) {
    auto t1 = std::chrono::high_resolution_clock::now();

    std::vector<size_t> rez(k);
    std::iota(rez.begin(), rez.end(), 0);
    size_t batch = static_cast<size_t>(n / k) + 1;
    std::for_each(std::execution::par_unseq, rez.begin(), rez.end(),
      [&](size_t id) {
        size_t cum = 0;
        for (size_t i = id*batch; i < std::min((id+1)*batch, n); ++i) {
          cum += vec[i];
        }
        rez[id] = cum;
    });

    auto t2 = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
    std::cout << "n_worker = " << k
      << ", time = " << duration
      << ", rez = " << std::accumulate(rez.begin(), rez.end(), 0lu)
      << std::endl;
  };

  par_sum(1);
  par_sum(3);
  par_sum(5);
}

Compiled by

g++  -std=c++17 -L/usr/local/lib -O3 -mavx -ltbb a.cpp

Results show that

n_worker = 1, time = 51875, rez = 4999999950000000
n_worker = 3, time = 57616, rez = 4999999950000000
n_worker = 5, time = 63193, rez = 4999999950000000

Questions,

No performance gains against 1 worker, why?

100000000 elements of `size_t` is almost a gigabyte of data. You're almost certainly memory bound. — Mooing Duck, Oct 12 '20 at 19:56
If you had less than a GB of memory, then you'd be _disk_ bound. You're memory bound because your CPU doesn't have 1GB of _L1 cache_. L1 cache is usually 256KB-1MB. L2 cache is 256KB to 8MB. L3 cache is usually 4MB to upwards of 50MB. Since you have more data than that, the CPU is spending all it's time waiting for more data to come back from the RAM sticks. — Mooing Duck, Oct 12 '20 at 19:59
offtopic: IMPO lambdas like this `par_sum` are harmful. Can't you define a class with a method to make this easier to read and maintain. — Marek R, Oct 12 '20 at 19:59
@MarekR You mean lambda par_sum is harmful to performance? If it's about readability, sure, advice accepted. — avocado, Oct 12 '20 at 20:01
@avocado no, just to code readability and maintenance. For compiler there is no difference. — Marek R, Oct 12 '20 at 20:11
My result: `n_worker = 1, time = 67489, rez = 887459712 n_worker = 3, time = 25965, rez = 887459712 n_worker = 5, time = 16598, rez = 887459712` — Ted Lyngmo, Oct 12 '20 at 20:57
gcc under WSL2 shows similar figures, but gcc on my linux-box doesn't show the same improvement. — Ted Lyngmo, Oct 12 '20 at 21:08
@TedLyngmo, interesting then! So gcc is performing worse than MSVC? — avocado, Oct 12 '20 at 21:10
Well, gcc in WSL2 on the same machine as I'm running MSVC performs really well too (even a little better than the MSVC version - but in the same ballpark). It's only on linux that it's pretty "bad" — Ted Lyngmo, Oct 12 '20 at 21:12

Den-Jason · Answer 1 · 2020-10-12T20:14:06.093

3

I would posit that for a small amount of work it may be the case that a tight loop could execute in one CPU by purely keeping within the L1 cache context. As soon as you increase the amount of parallelism on the same data you begin to invoke overhead with cache consistency and page faults.

It would probably be worth you looking at ways of counting cache misses such as those suggested here: Programmatically counting cache faults

and also see: What is a "cache-friendly" code?

plus other resources on "cache friendly code" and "data oriented design": https://www.youtube.com/watch?v=b5v9aElYU2I

and https://youtu.be/fHNmRkzxHWs

This is related to false sharing which is explained here: https://www.youtube.com/watch?v=dznxqe1Uk3E

edited Oct 12 '20 at 20:14

answered Oct 12 '20 at 19:58

Den-Jason

2,395
1
22
17

1

Thanks for sharing the above links! Actually the code in my post is meant to verify paralell `for_each` could boost perf or not, if it doesn't work that way, then I would investigate further and see what tricks to use. But looks like what you shared here doesn't seem to be a simple solution. – avocado Oct 12 '20 at 20:05
You could possibly look at compartmentalising the problem such that each worker operates on completely separate data. You would still need to map-reduce though, and there will be some threshold below which it would be more efficient to simply use a tight one-CPU loop. – Den-Jason Oct 12 '20 at 20:08
Added another video link to the answer – Den-Jason Oct 12 '20 at 20:12
I doin't think the problem is parallism on the same data, I think the problem is just loading 1GB of data from RAM into the CPU. The problem is the sheer amount of data – Mooing Duck Oct 12 '20 at 20:12
@MooingDuck So you mean, if the data is small, I could observe perf gains? – avocado Oct 12 '20 at 20:14
1

Or if you're doing more work on each item of data, yes. https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory implies that loading data from RAM takes approximately the same time as a 60 to 100 addition operations. – Mooing Duck Oct 12 '20 at 20:18
Out of interest, have you tried using `std::execution::par` instead of `std::execution::par_unseq`? @MooingDuck that is an awesome link. – Den-Jason Oct 12 '20 at 20:20
@MooingDuck Could you please share an example of parallel for_each? I started doubting if it would work on my mac. – avocado Oct 12 '20 at 20:20
1

Tried `std::execution::par` just now, same time across 1/3/5 workers. @MooingDuck, even with much less data (1m numbers now). – avocado Oct 12 '20 at 20:22
1

Sure. With _any_ amount of numbers, one thread will always best the same or fastest, as long as you're only doing a single addition per item. If you were doing something CPU bound, such as encryption or compression per item, then you'd get gains from more threads. – Mooing Duck Oct 12 '20 at 20:24
1

FYI, I have just tried this on Windows using MSVC-2019, results are: `n_worker = 1, time = 130653, rez = 887459712` `n_worker = 3, time = 99364, rez = 887459712` `n_worker = 5, time = 78471, rez = 887459712` – Den-Jason Oct 12 '20 at 20:43
1

@Den-Jason That's similar to my result (also with MSVC). I've often found MSVC's implementation to be a lot quicker than the gcc version although I think they both use TBB as a backend. Not sure why exactly. – Ted Lyngmo Oct 12 '20 at 21:05
@TedLyngmo One thing I want to flag is that, I use gnu-g++ on Mac OSX, would it be a problem? I have to use gnu-g++, since Mac clang++ doesn't support parallel for_each. – avocado Oct 12 '20 at 21:07
1

I think it's a mix of both the backend (tbb I guess) and the compiler. I get different results for gcc depending on if I'm compiling in linux and in WSL2 for example. I do think you should pay attention to the cachelines and false sharing as Den-Jason mentions. I've had incredible improvements when making sure that no false sharing occurs. – Ted Lyngmo Oct 12 '20 at 21:11
@TedLyngmo, yes, should pay attention to false-sharing. But I don't think in my above code there is any false-sharing though. – avocado Oct 12 '20 at 21:12
1

`vec` is used by all threads so that could be a problem. I think that may destroy performace. Also, the entities in `rez` are too close. I'd use `alignas(std::hardware_destructive_interference_size)` to separate them. I don't think gcc has that constant so set it to 64. – Ted Lyngmo Oct 12 '20 at 21:15
@TedLyngmo Thanks for your advice, didn't know `hardware_destructive_interference_size` before. But also interestingly, with the same code, you could get perf gains on Windows, I just cannot on Mac. That's sad. – avocado Oct 12 '20 at 21:16
Yeah, it's a bit confusing. I've noticed the same on linux. :-( – Ted Lyngmo Oct 12 '20 at 21:17
Oh, sorry, I see that `rez` is local in the lambda, so that shouldn't be a problem. – Ted Lyngmo Oct 12 '20 at 21:19
@TedLyngmo also `vec` is just read-only in each worker, no updates at all, so each worker/core cache-line should be fine? – avocado Oct 12 '20 at 21:20
2

I'm not sure about that. Every read drags `vec` into the cacheline, I think you should keep all the values that are used by one thread really close (preferably within `std::hardware_constructive_interference_size`) and separate the different threads datasets with at least `std::hardware_destructive_interference_size,`. – Ted Lyngmo Oct 12 '20 at 21:26
Also bear in mind that other factors come into play when profiling code, such as how many other processes are running. If I run this code 10 times I will see 10 different sets of results; sometimes the more CPUs are faster, sometimes slower. Plus the subsequent runs tend to be faster too (as you'd expect). Currently looking at using this to monitor cache miss counts: https://github.com/opcm/pcm – Den-Jason Oct 12 '20 at 21:57
@TedLyngmo if I refactor the above code so that naked pointers are used to read the non-overlapping segments of `vec`, it is faster overall but still you may see more CPUs taking more time. I believe that's due to "external factors" I mention above. I haven't tried setting process priority to "realtime" yet ;) – Den-Jason Oct 12 '20 at 22:00
1

@Den-Jason Could be. It may be worth creating a cusom allocator to create workpackages that have no external pointers but have all data packaged in one memory blob and does not interfere with other cachelines. – Ted Lyngmo Oct 13 '20 at 04:13

c++ std library parallel execution with for_each slower than sequential loop

1 Answers1