multi-process MPI vs. multithreaded std::thread performance

Question

I wrote a simple test program to compare performance of parallelizing over multiple processes using MPI, or over multiple threads with std::thread. The work that is being parallelized is simply writing into a large array. What I'm seeing is that multi-process MPI outperforms multithreading by quite a wide margin.

The test code is:

#ifdef USE_MPI
#include <mpi.h>
#else
#include <thread>
#endif
#include <iostream>
#include <vector>

void dowork(int i){
    int n = 1000000000;
    std::vector<int> foo(n, -1);
}

int main(int argc, char *argv[]){
    int npar = 1;
#ifdef USE_MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &npar);
#else
    npar = 8;
    if(argc > 1){
        npar = atoi(argv[1]);
    }
#endif
    std::cout << "npar = " << npar << std::endl;

    int i;

#ifdef USE_MPI
    MPI_Comm_rank(MPI_COMM_WORLD, &i);
    dowork(i);
    MPI_Finalize();
#else
    std::vector<std::thread> threads;
    for(i = 0; i < npar; ++i){
        threads.emplace_back([i](){
            dowork(i);
        });
    }
    for(i = 0; i < npar; ++i){
        threads[i].join();
    }
#endif
    return 0;
}

The Makefile is:

partest_mpi:
    mpic++ -O2 -DUSE_MPI  partest.cpp -o partest_mpi -lmpi
partest_threads:
    c++ -O2 partest.cpp -o partest_threads -lpthread

And the results of execution are:

$ time ./partest_threads 8
npar = 8

real    0m2.524s
user    0m4.691s
sys 0m9.330s

$ time mpirun -np 8 ./partest_mpi
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8npar = 8


real    0m1.811s
user    0m4.817s
sys 0m9.011s

So the question is, why is this happening, and what can I do on the threaded code to make it perform better? I'm guessing this has something to do with memory bandwidth and cache utilization. I'm running this on an Intel i9-9820X 10-core CPU.

Make sure you disable frequency scaling before running your benchmarks. https://stackoverflow.com/a/9006802/412080 — Maxim Egorushkin, Jun 24 '21 at 22:51
Is the OS mapping your threads all to the same core? Print out what core you are running on, using hwloc or a similar tool. Alternatively, use a pinning tool to prevent the OS from migrating your threads/processes. — Victor Eijkhout, Jun 25 '21 at 16:11

score 4 · Accepted Answer · answered Jun 25 '21 at 00:05

TL;DR: make sure you have enough RAM and benchmark metrics are accurate. That being said, I am not able to reproduce such a difference on my machine (ie. I get identical performance results).

On most platforms, your code allocates 30 GB (since sizeof(int)=4 and each process/thread perform the allocation of the vector and items are initialized by the vector). Thus, you should first ensure you have at least enough RAM to do that. Otherwise data may be written to a (much slower) storage device (eg. SSD/HDD) due to memory swapping. Benchmarks are not really useful in such an extreme case (especially because result will likely be unstable).

Assuming you have enough RAM, your application is mostly bound by page-faults. Indeed, on most modern mainstream platforms, the operating system (OS) will allocate virtual memory very quickly, but it will not map it to physical memory directly. This mapping process is often done when a page is read/written for the first time (ie. page-fault) and is known to be slow. Moreover, for security reasons (eg. not to leak credentials of other processes), the most OS will zeroize each page when they are written for the first time, making page fault even slower. On some system, it may not scale well (although it should be fine on typical desktop machines with Windows/Linux/Mac). This part of the time is reported as system time.

The rest of the time is mainly the one required to fill the vector in RAM. This part barely scale on many platforms: generally 2-3 cores are clearly enough to saturate the RAM bandwidth on desktop machines.

That being said, on my machine, I am unable to reproduce the same outcome with 10 times less memory allocated (as I do not have 30 GB of RAM). The same apply for 4 times less memory. Actually, the MPI version is much slower on my Linux machine with a i7-9600KF. Note that results are relatively stable and reproducible (whatever the ordering and the number of run made):

time ./partest_threads 6 > /dev/null
real    0m0,188s
user    0m0,204s
sys 0m0,859s

time mpirun -np 6 ./partest_mpi > /dev/null
real    0m0,567s
user    0m0,365s
sys 0m0,991s

The bad result of the MPI version comes from the slow initialization of MPI runtime on my machine since a program performing nothing takes roughly 350 ms to be initialized. This actually shows the behavior is platform-dependent. At least, it shows that time should not be used to measure the performance of the two applications. One should use instead monotonic C++ clocks.

Once the code has been fixed to use an accurate timing method (with C++ clocks and MPI barriers), I get very close performance results between the two implementations (10 runs, with sorted timings):

pthreads:
Time: 0.182812 s
Time: 0.186766 s
Time: 0.187641 s
Time: 0.18785 s
Time: 0.18797 s
Time: 0.188256 s
Time: 0.18879 s
Time: 0.189314 s
Time: 0.189438 s
Time: 0.189501 s
Median time: 0.188 s

mpirun:
Time: 0.185664 s
Time: 0.185946 s
Time: 0.187384 s
Time: 0.187696 s
Time: 0.188034 s
Time: 0.188178 s
Time: 0.188201 s
Time: 0.188396 s
Time: 0.188607 s
Time: 0.189208 s
Median time: 0.188 s

For a deeper analysis on Linux, you can use the perf tool. A kernel-side profiling show that most of the time (60-80%) is spent in the kernel function clear_page_erms which zeroize pages during page-faults (as described before) followed by __memset_avx2_erms which fills the vector values. Other functions takes only a tiny fraction of overall run time. Here is an example with pthread:

  64,24%  partest_threads  [kernel.kallsyms]              [k] clear_page_erms
  18,80%  partest_threads  libc-2.31.so                   [.] __memset_avx2_erms
   2,07%  partest_threads  [kernel.kallsyms]              [k] prep_compound_page
   0,86%  :8444            [kernel.kallsyms]              [k] clear_page_erms
   0,82%  :8443            [kernel.kallsyms]              [k] clear_page_erms
   0,74%  :8445            [kernel.kallsyms]              [k] clear_page_erms
   0,73%  :8446            [kernel.kallsyms]              [k] clear_page_erms
   0,70%  :8442            [kernel.kallsyms]              [k] clear_page_erms
   0,69%  :8441            [kernel.kallsyms]              [k] clear_page_erms
   0,68%  partest_threads  [kernel.kallsyms]              [k] kernel_init_free_pages
   0,66%  partest_threads  [kernel.kallsyms]              [k] clear_subpage
   0,62%  partest_threads  [kernel.kallsyms]              [k] get_page_from_freelist
   0,41%  partest_threads  [kernel.kallsyms]              [k] __free_pages_ok
   0,37%  partest_threads  [kernel.kallsyms]              [k] _cond_resched
[...]

If there is any intrinsec performance overhead of one of the two implementation, perf should be able to report it. If you are running on a Windows, you can use another profiling tool like VTune for example.

Allocating large vectors is sub-optimal when `malloc` invokes `mmap` because `std::vector` constructor initializes all elements and that causes one page fault for each page. It would be ideal to invoke `mmap` with `MAP_POPULATE` flag to avoid these unnecessary page faults, however, there is no interface for `std::vector` to propagate such a hint to the OS. `mmap` zeros-out returned pages to avoid information leakage, that could be used as an optimization to avoid `std::vector` from having to zero-out the elements once again unless a non-zero bit pattern default is supplied. — Maxim Egorushkin, Jun 25 '21 at 03:35
I agree. However, last time I used `MAP_POPULATE` to speed up page-faults, it was not much faster. I think one of the main issue is the small page size on most systems. Huge page may help significantly in that case (although Linux/Max are supposed to use transparent huge page nowadays, IDK for Windows). For the `std::vector`, I think one could use a specific user allocator to use `mmap` internally with specific flags. It would be interesting to see whether it could be much faster or not. — Jérôme Richard, Jun 25 '21 at 11:23

multi-process MPI vs. multithreaded std::thread performance

1 Answers1