TL;DR: make sure you have enough RAM and benchmark metrics are accurate. That being said, I am not able to reproduce such a difference on my machine (ie. I get identical performance results).
On most platforms, your code allocates 30 GB (since sizeof(int)=4
and each process/thread perform the allocation of the vector and items are initialized by the vector). Thus, you should first ensure you have at least enough RAM to do that. Otherwise data may be written to a (much slower) storage device (eg. SSD/HDD) due to memory swapping. Benchmarks are not really useful in such an extreme case (especially because result will likely be unstable).
Assuming you have enough RAM, your application is mostly bound by page-faults. Indeed, on most modern mainstream platforms, the operating system (OS) will allocate virtual memory very quickly, but it will not map it to physical memory directly. This mapping process is often done when a page is read/written for the first time (ie. page-fault) and is known to be slow. Moreover, for security reasons (eg. not to leak credentials of other processes), the most OS will zeroize each page when they are written for the first time, making page fault even slower. On some system, it may not scale well (although it should be fine on typical desktop machines with Windows/Linux/Mac). This part of the time is reported as system time.
The rest of the time is mainly the one required to fill the vector in RAM. This part barely scale on many platforms: generally 2-3 cores are clearly enough to saturate the RAM bandwidth on desktop machines.
That being said, on my machine, I am unable to reproduce the same outcome with 10 times less memory allocated (as I do not have 30 GB of RAM). The same apply for 4 times less memory. Actually, the MPI version is much slower on my Linux machine with a i7-9600KF. Note that results are relatively stable and reproducible (whatever the ordering and the number of run made):
time ./partest_threads 6 > /dev/null
real 0m0,188s
user 0m0,204s
sys 0m0,859s
time mpirun -np 6 ./partest_mpi > /dev/null
real 0m0,567s
user 0m0,365s
sys 0m0,991s
The bad result of the MPI version comes from the slow initialization of MPI runtime on my machine since a program performing nothing takes roughly 350 ms to be initialized. This actually shows the behavior is platform-dependent. At least, it shows that time
should not be used to measure the performance of the two applications. One should use instead monotonic C++ clocks.
Once the code has been fixed to use an accurate timing method (with C++ clocks and MPI barriers), I get very close performance results between the two implementations (10 runs, with sorted timings):
pthreads:
Time: 0.182812 s
Time: 0.186766 s
Time: 0.187641 s
Time: 0.18785 s
Time: 0.18797 s
Time: 0.188256 s
Time: 0.18879 s
Time: 0.189314 s
Time: 0.189438 s
Time: 0.189501 s
Median time: 0.188 s
mpirun:
Time: 0.185664 s
Time: 0.185946 s
Time: 0.187384 s
Time: 0.187696 s
Time: 0.188034 s
Time: 0.188178 s
Time: 0.188201 s
Time: 0.188396 s
Time: 0.188607 s
Time: 0.189208 s
Median time: 0.188 s
For a deeper analysis on Linux, you can use the perf
tool. A kernel-side profiling show that most of the time (60-80%) is spent in the kernel function clear_page_erms
which zeroize pages during page-faults (as described before) followed by __memset_avx2_erms
which fills the vector values. Other functions takes only a tiny fraction of overall run time. Here is an example with pthread:
64,24% partest_threads [kernel.kallsyms] [k] clear_page_erms
18,80% partest_threads libc-2.31.so [.] __memset_avx2_erms
2,07% partest_threads [kernel.kallsyms] [k] prep_compound_page
0,86% :8444 [kernel.kallsyms] [k] clear_page_erms
0,82% :8443 [kernel.kallsyms] [k] clear_page_erms
0,74% :8445 [kernel.kallsyms] [k] clear_page_erms
0,73% :8446 [kernel.kallsyms] [k] clear_page_erms
0,70% :8442 [kernel.kallsyms] [k] clear_page_erms
0,69% :8441 [kernel.kallsyms] [k] clear_page_erms
0,68% partest_threads [kernel.kallsyms] [k] kernel_init_free_pages
0,66% partest_threads [kernel.kallsyms] [k] clear_subpage
0,62% partest_threads [kernel.kallsyms] [k] get_page_from_freelist
0,41% partest_threads [kernel.kallsyms] [k] __free_pages_ok
0,37% partest_threads [kernel.kallsyms] [k] _cond_resched
[...]
If there is any intrinsec performance overhead of one of the two implementation, perf
should be able to report it. If you are running on a Windows, you can use another profiling tool like VTune for example.