Possible causes for icc (2019) with -O3 -march=native on Xeon Gold 6126 producing slower exe than -O3 -xCORE-AVX512?

Question

The purpose of the question is to ask about possible causes regarding the program's behaviour as a function of icc 2019's compilation flags, considering two phenomena and the information provided in the notes below.

A program can run three types of simulations, let's name them S1, S2 and S3.

Compiled (and ran) on Intel Xeon Gold 6126 nodes the program has the following behaviour, expressed as

A ± B

where A is the mean time, B is the standard deviation, and the units are microseconds.

When compiled with -O3:

S1: 104.7612 ± 108.7875 EDIT: it's 198.4268 ± 3.5362
S2: 3.8355 ± 1.3025 EDIT: it's 3.7734 ± 0.1851
S3: 11.8315 ± 3.5765 EDIT: it's 11.4969 ± 1.313

When compiled with -O3 -march=native:

S1: 102.0844 ± 105.1637 EDIT: it's 193.8428 ± 3.0464
S2: 3.7368±1.1518 EDIT: it's 3.6966 ± 0.1821
S3: 12.6182 ± 3.2796 EDIT: it's 12.2893 ± 0.2156

When compiled with -O3 -xCORE-AVX512:

S1: 101.4781 ± 104.0695 EDIT: it's 192.977±3.0254
S2: 3.722 ± 1.1538 EDIT: it's 3.6816 ± 0.162
S3: 12.3629 ± 3.3131 EDIT: it's 12.0307 ± 0.2232

Two conclusions:

-xCORE-AVX512 produces code that is more performant than -march=native
the program's simulation called S3 DECREASES its performance when compiled considering the architecture.

Note1: the standard deviation is huge, but repeated tests yield always similar values for the mean that leave the overall ranking unchanged.

Note2: the code runs on 24 processors and Xeon Gold 6126 has 12 physical cores. It's hyper-threading but each two threads per core DO NOT share memory.

Note3: the functions of S3 are "very sequential", i.e. cannot be vectorized.

There is no MWE. Sorry, the code is huge and cannot be posted here.

EDIT: print-related outliers were to blame for the large deviation. The means were slightly changed but the trend and hierarchies remains.

`code is huge and cannot be posted here` you need to create a [mcve], not posting the whole code — phuclv, Jul 15 '22 at 12:36
@phuclv is it mandatory tho? If so I'm sorry, I would procede to delete the question. Lit impossible to reproduce a 100k-lines behaviour in a toy model, at least not in reasonable time. — Gaston, Jul 15 '22 at 12:44
Tiny differences in tuning choices for code-gen might result in alignment differences that end up mattering more. Especially [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) on Skylake-family CPUs if `-march=native` or `-xCORE-AVX512` didn't enable a workaround option. — Peter Cordes, Jul 15 '22 at 13:01
Definitely profile the code. You might be able to isolate one or two functions that do exhibit a statistically significant difference, and post a question about those. — MSalters, Jul 15 '22 at 13:27
How do you estimate standard deviation? If repeated tests always land in the same range, maybe it is overestimated? — pqnet, Jul 15 '22 at 13:29
Indeed @MSalters, the 80% consumer is a function of ~5k lines with approximately equal time consumption in all the subsections, but congratulations on your thinking for this and the topic of the normal distribution :-) — Gaston, Jul 15 '22 at 15:00
@pqnet indeed it was the case, 1 of the N=24 processes were spending "some" time printing results every M ~ 5 iterations. Thanks! — Gaston, Jul 15 '22 at 15:01
@PeterCordes Would that be consistent with the following result? The original version, without `-march=native` nor `-xCORE-AVX512` worsens the performance by approx. the same amount if the `new double[N]` allocations are aligned to 64 bytes with intel's overriden aligned new allocator? Would you put that into an answer so that I can accept it? — Gaston, Jul 15 '22 at 15:03
I wasn't talking about data alignment *at all*. I was talking about code alignment. Whatever difference that change made might just have been a fluke that also resulted in a change in code alignment. — Peter Cordes, Jul 15 '22 at 15:06

MSalters · Answer 1 · 2022-07-15T17:27:36.900

Your premise is wrong. The differences in all cases are a tiny fraction of one standard deviation. A statistically significant result typically is > 2 standard deviations.

Of course, if we see a duration of 104.7612 ± 108.7875, we know that the expected runtime cannot be normally distributed, since that would imply a >16% chance of finishing before it starts! (negative runtime). But there are other distributions with long tails where the standard deviation can be bigger than the mean. Without knowing the exact distribution, I'm not entirely sure if the "> 2 standard deviations" rule of thumb holds up, but <0.1 standard deviation difference is definitely not significant.

[edit] New figures are S1 differs by about .3 sigma, S2 by about .05 sigma and S3 by about 1.2 sigma. Individually that's still inconclusive, and there's still not enough code to say whether there's a correlation. Certainly can't simply add them up, but even if you did you still wouldn't hit 2 sigma.

E.g. a simple exponential distribution has equal mean and standard deviation (1/λ), which at first glance could explain the 104.7612 ± 108.7875. But that would have a maximum probability for t=0, which is rather remarkable for runtimes. — MSalters, Jul 15 '22 at 13:34
An edit was made and the deviation has decreased considerably, nevertheless the arguments about the normal distribution are insightful and remain valid for other cases. — Gaston, Jul 15 '22 at 15:15

score 0 · Accepted Answer · answered Jul 15 '22 at 15:31

An acceptable possible explanation was outlined in the comments, it read:

Tiny differences in tuning choices for code-gen might result in alignment differences that end up mattering more. Especially How can I mitigate the impact of the Intel jcc erratum on gcc? on Skylake-family CPUs if -march=native or -xCORE-AVX512 didn't enable a workaround option.

Possible causes for icc (2019) with -O3 -march=native on Xeon Gold 6126 producing slower exe than -O3 -xCORE-AVX512?

2 Answers2