1

I'm trying to make some piece of code run more faster. It is floating point intensive code -- taking as input:

  • parameters (constant, double, int)
  • array of input values (constant, double)

Output is

  • array of values (double)
  • jacobian matrix

Currently I'm using

g++-7 (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0

and the following command line:

g++-7 -S -fPIC -O3 -DNDEBUG -funroll-loops -march=native -ffast-math \
-I $BOOST_DIR tmp.cpp -std=c++17 \
-D__forceinline='__attribute__((always_inline))' \
-frecord-gcc-switches -Wno-attributes

From my memory the G++ compiler produced better code in the past -- and also was chewing on such code much longer. I've tried to play with various options, but only

--param max-gcse-memory=1

seems to have any effect -- between using or not using this argument. Changes of the parameter value are ignored.

My criteria for better code is the amount of vmov/mov instruction in the code compared to vmul[sp]d instructions. Better code should contain fewer [v]mov instructions.

When using

--param max-gcse-memory=1

I'm getting 10766 [v]mov instructions compared to 11325 without this parameter. This compares to 1000 vmulpd and 1900 vmulsd -- the number being more or less constant between both tries.

Again -- I don't mind the compile time. I would like to get better code and from what I remember in the past (2010) I've got better code including much longer compile time.

1 Answers1

1

SIMD instructions often require aligned data. It sounds like GCC is generating a lot of code to protect against insufficiently aligned data.

If you can modify code, it sounds like you would benefit from some use of the aligned attribute or, even better, OpenMP SIMD pragmas.

Depending on how your program is structured, LTO (-flto) could make a big difference, as can limiting function visibility (i.e., -fvisibility=hidden).

Basically, you want to give the optimizer as much room to work as possible so it can drop a lot of extra code to get things properly aligned for SIMD instructions.

You may also want to consider enabling more ISA extensions... AVX supports 256-bit vectors, which means you can do twice as much work with an instruction, and there is a good chance your CPU supports it. If you're shipping executables to run on other computers, consider using the target_clones attribute for an easy way to generate code optimized for multiple ISA extensions.

nemequ
  • 16,623
  • 1
  • 43
  • 62
  • The OP is using `-march=native`, which enables everything your CPU supports, and sets `-mtune=skylake` if you're on a skylake, or whatever. `LTO` is a good suggestion, though, and OpenMP can be helpful if you can't get gcc to auto-vectorize some important loops without it (e.g. not enough use of `__restrict`, or if it does a better job with OpenMP. – Peter Cordes Dec 28 '17 at 01:05
  • Ah, you're right; I missed the -march=native. OpenMP also lets you specify alignment, which is the context I was talking about here. Based on the instructions he's talking about it sounds like GCC is vectorizing the code, it's just also doing a lot of prep work; letting the compiler know stuff is aligned and letting it more aggressively optimized (LTO + visibility) should go a long way in that case. – nemequ Dec 28 '17 at 02:02
  • Yes, gcc's vanilla auto-vectorization strategy is fully-unrolled scalar intros/outros to reach an alignment boundary for one of the pointers in the inner loop (but then no unrolling at all for the inner loop). With wide vectors (and especially with narrow integer types) this is a lot of code bloat. See https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output for tips on looking at compiler output, TL:DR: put it up on http://gcc.godbolt.org/. – Peter Cordes Dec 28 '17 at 02:35