The purpose of the question is to ask about possible causes regarding the program's behaviour as a function of icc 2019's compilation flags, considering two phenomena and the information provided in the notes below.
A program can run three types of simulations, let's name them S1
, S2
and S3
.
Compiled (and ran) on Intel Xeon Gold 6126 nodes the program has the following behaviour, expressed as
A ± B
where A
is the mean time, B
is the standard deviation, and the units are microseconds.
When compiled with -O3
:
S1
: 104.7612 ± 108.7875EDIT: it's 198.4268 ± 3.5362
S2
: 3.8355 ± 1.3025EDIT: it's 3.7734 ± 0.1851
S3
: 11.8315 ± 3.5765EDIT: it's 11.4969 ± 1.313
When compiled with -O3 -march=native
:
S1
: 102.0844 ± 105.1637EDIT: it's 193.8428 ± 3.0464
S2
: 3.7368±1.1518EDIT: it's 3.6966 ± 0.1821
S3
: 12.6182 ± 3.2796EDIT: it's 12.2893 ± 0.2156
When compiled with -O3 -xCORE-AVX512
:
S1
: 101.4781 ± 104.0695EDIT: it's 192.977±3.0254
S2
: 3.722 ± 1.1538EDIT: it's 3.6816 ± 0.162
S3
: 12.3629 ± 3.3131EDIT: it's 12.0307 ± 0.2232
Two conclusions:
-xCORE-AVX512
produces code that is more performant than-march=native
- the program's simulation called S3 DECREASES its performance when compiled considering the architecture.
Note1: the standard deviation is huge, but repeated tests yield always similar values for the mean that leave the overall ranking unchanged.
Note2: the code runs on 24 processors and Xeon Gold 6126 has 12 physical cores. It's hyper-threading but each two threads per core DO NOT share memory.
Note3: the functions of S3
are "very sequential", i.e. cannot be vectorized.
There is no MWE. Sorry, the code is huge and cannot be posted here.
EDIT: print-related outliers were to blame for the large deviation. The means were slightly changed but the trend and hierarchies remains.