0

I'm writing a C++ program that shall solve PDEs and algebraic equations on networks. The Eigen library shoulders the biggest part of the work by solving many sparse linear systems with LU decomposition.

As performance is always nice I played around with options for that. I'm using

g++ -O3 -DNDEBUG -flto -fno-fat-lto-objects -std=c++17

as performance-related compiler options. I then added the -march=native option and found that execution time increased on average by approximately 6% (tested by gnu time with a sample size of about 10 runs per configuration. There was almost no variance for both settings).

What are possible (or preferably likely) reasons for such an observation.

I guess the output of lscpu might be useful here, so this is it:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           78
Model name:                      Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
Stepping:                        3
CPU MHz:                         800.026
CPU max MHz:                     3100.0000
CPU min MHz:                     400.0000
BogoMIPS:                        5199.98
Virtualization:                  VT-x
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        4 MiB

Edit: As requested here are the cpu flags:

vendor_id   : GenuineIntel
cpu family  : 6
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz

flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
vmx flags   : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple pml
bogomips    : 5199.98
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
Eike
  • 359
  • 3
  • 8
  • Sounds unlikely, as this would be an extremely severe bug on gcc side. – SergeyA Mar 04 '21 at 21:15
  • 1
    Many optimizations are heuristics and sometimes they get it wrong. There's not really any way to guess without more specifics. If you can profile to identify a small section of code that has slowed down, and compare the assembly with and without `-march=native`, someone might be able to see where the compiler's assumptions were mistaken (or simply buggy). – Nate Eldredge Mar 04 '21 at 21:18
  • 1
    You've ruled out things like system load or CPU clock throttling changing between your tests, I suppose? – Nate Eldredge Mar 04 '21 at 21:19
  • -O2 is often faster than -O3, sometimes less is more – Apriori Mar 04 '21 at 21:24
  • 1
    @NateEldredge: What I have done, is run both versions repeatedly on an otherwise not very busy computer. After the first 2 or so runs, the execution time stabilized for each version. And from then on I noted down the values. I'm not sure about the 6%, but I'm very sure of "slower with `-march=native`. Maybe tomorrow I find the time for more rigorous testing. – Eike Mar 04 '21 at 21:30
  • @Apriori I'll try that out, thanks for the hint. Although I'd still like to find out, what happens with this configuration. – Eike Mar 04 '21 at 21:32
  • @NateEldredge Do you have more insights on these heuristics? Most of the code I have written uses virtual functions, so is probably not the best from a performance standpoint. Not sure whether that is relevant. – Eike Mar 04 '21 at 21:35
  • Can you post the output of `cat /proc/cpuinfo`? Only 1 cpu is sufficient, in particular the `flags`. A _very_ long shot could be mixing of [SSE and AVX](https://software.intel.com/content/www/us/en/develop/articles/avoiding-avx-sse-transition-penalties.html) – sbabbi Mar 04 '21 at 21:38
  • There are a huge number. Based on cache sizes, when should functions be inlined? How much should loops be unrolled? When is vectorization desirable? Which combinations of certain instructions are better than others? When to trust branch prediction, and when to prefer branchless code? Etc, etc, etc. – Nate Eldredge Mar 04 '21 at 21:43
  • @sbabbi I have inserted it. – Eike Mar 04 '21 at 21:52
  • 1) maybe denormalized floating point plays a role, it's very specific on cpu, runtime settings and code/algorithms 2) as @NateEldredge said: really try to look at the assembly generated (either `gcc -S` when compiling, or `objdump -d` the binary), it's not hard and should be the first thing to check I think, because it takes out a lot of guesswork. I regularly use a simple `sed` script to strip the .S to bare assembly (that is: devoid of addresses) and then diff between gcc runs to see whether compiler flags and custom compiler patches perform as intended. – mvds Mar 04 '21 at 22:35
  • @mvds Can I ask you to provide a step by step list of commands for checking the assembly? I never did before. Also: What is a good point for that? The whole program has a size of 600 KB (with `-O3` and `-flto` and I guess its too big to reason about the assembly (?). – Eike Mar 05 '21 at 07:12
  • 1
    @Eike ok, that’s quite big. But still: you should have a hunch already what code paths are most expensive in terms of runtime, those are the ones to check. With `objdump -d filename > filename.S` on the binary you generate an assembly file. In it you find practically each C function in its compiled form. Try to drop the -O flag; if the speed difference persists, diff the un-optimized S files generated with and without the flag you’re investigating. Look for the functions where your program spends most time, and see if there’s any obvious similarity or difference in generated instructions. – mvds Mar 05 '21 at 07:27
  • Also, regarding denormalized fp, check this: https://stackoverflow.com/q/9314534/371793 – mvds Mar 05 '21 at 07:29

1 Answers1

3

There are plenty of reasons why a code can be slower with -march=native, although this is quite exceptional.

That being said, in your specific context, one possible scenario is the use of slower SIMD instructions, or more precisely different SIMD instructions finally making the program slower. Indeed, GCC vectorize loops with -O3 using the SSE instruction set on x86 processors such as yours (for backward compatibility). With -march=native, GCC will likely vectorize loops using the more advanced and more recent AVX instruction set (supported by your processor but not on many old x86 processors). While the use of AVX instruction should speed your program up, it is not always the case in few pathological cases (less efficient code generated by compiler heuristics, loops are too small to leverage AVX instructions, missing/slower AVX instructions available in SSE, alignment, transition penality, energy/frequency impact, etc.).

My guess is your program is memory bound and thus AVX instructions do not make your program faster.

You can confirm this hypothesis by enabling AVX manually using -mavx -mavx2 rather than -march=native and look if your performance issue is still there. I advise you to carefully benchmark your application using a tool like perf.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59