How does reordering numerical code in order to avoid temporary variables make the code faster?

Question

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.

I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.

Is this assumption in fact the case?

Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?

Nope; under the [_"as-if-rule"_](https://en.cppreference.com/w/cpp/language/as_if) the optimiser can remove temporaries, reorder calculations etc. Best practice is to write clear, obvious code, let the optimiser do its work, and if later there are performance concerns measure and profile before modifying any code. — Richard Critten, Jan 19 '23 at 18:36
Do you have any example, I know you're asking a general question but it's based on a premise that is opposite to my experience, which is that local variables are handled just fine (optimized away, put into registers, all of the usual things, which is to be expected from compilers that use SSA in their internal representation) — harold, Jan 19 '23 at 18:37
IMHO, tagging variables that are not modified, as `const`, helps the compiler better optimize the code. The tagging also helps the compiler identifier suspicious code that is writing to a read-only variable. — Thomas Matthews, Jan 19 '23 at 18:37
often the compiler can see whether a variable is modified or not also without `const` and `const` is mainly for the programmer to help them avoid mistakes — 463035818_is_not_an_ai, Jan 19 '23 at 18:38
any assumption about performance and optimizations is only worth as much as it can be backed up by actual evidence and measurement. — 463035818_is_not_an_ai, Jan 19 '23 at 18:39
If the variables are constant numeric variables, the compiler may place the data into the executable section (as part of the assembly instructions). This could speed up the code. Otherwise, the data needs to be fetched into a register from data memory, which could cause a data cache reload. — Thomas Matthews, Jan 19 '23 at 18:40
You can explore quite a bit using [compiler explorer](https://godbolt.org/). You can reorder your code as much as you wish and explore the generated assembly. In any case don't underestimate [what your compiler can do for you](https://www.youtube.com/watch?v=bSkpMdDe4g4) — Pepijn Kramer, Jan 19 '23 at 18:48
I feel this question is opinion-based. You should focus more, perhaps presenting a real world problem. As it is, it's wide open to speculation. — Something Something, Jan 19 '23 at 18:58
@ThomasMatthews Defining const as empty caused also compile time to go up dramatically. — Frank Puck, Jan 19 '23 at 19:00
@PepijnKramer what does this compiler explorer do, what I cannot do myself? Any new functionality? — Frank Puck, Jan 19 '23 at 19:02
@ThomasMatthews I wasn't talking about constexpr -- e.g. about expressions, which do not change value depending on input change. I was simply talking about reformatting the code in order to avoid pointless variables and especially reusage of the same variables for fully independent things, e.g. by factoring local code into inline functions. — Frank Puck, Jan 19 '23 at 19:05
@NoleKsum For a real-world problem attempt to download e.g. Berkeley's bsim5 standard implementation or HISIM3. — Frank Puck, Jan 19 '23 at 19:09
@NoleKsum -- the effect does not show up in any minimally reproduceable example. — Frank Puck, Jan 19 '23 at 19:26
@FrankPuck It does show up. It is actually simple to come up with several examples. I do this for a living btw. The very second comment on this thread asked you to be more specific. I think you should at least attempt to present a case. A link to [Quickbench](https://quick-bench.com/) would be sufficient. — Something Something, Jan 19 '23 at 19:34
@FrankPuck: I don't understand. Eliminating unused variables is usually a Good Thing. However, there are ways of implementing data structures to reduce cache misses. Loop unrolling may help speed up execution as it reduces the quantity of execution cache reloads. Loop unrolling may help the compiler to use parallel instructions. Using statement blocks may help the compiler to identify when registers can be reused or to employ other optimizations. I have actually used a lot of temporary constant variables and the compiler will use it as an optimization sounding board. — Thomas Matthews, Jan 19 '23 at 20:56
The *truth* is in the emitted assembly language. Write your function, set the optimization level, compile, then look at the assembly language for the function. Try different coding methods and see which one helps the compiler emit the best assembly language. Sometimes, you just can't argue (and win) with the compiler. — Thomas Matthews, Jan 19 '23 at 20:58
*But likely on SO one has to periodically delete ones account, especially since above mentioned behavior.* - One bad question won't sink your account if it's on the whole positive. Making a new account to get around rate limits is explicitly not allowed. — Peter Cordes, Jan 23 '23 at 18:35
This question would be fine, maybe even good, if it showed a source-code difference between the slow and fast versions, and details about what compiler/version/options you used, on what hardware. Even if that's only in a function that's part of a larger program, a diff would still be better than nothing. A [mcve] is ideal, but at least something that we could look at and maybe try ourselves in the full software package. As it is, there's nothing, only a claim that you observed something that's the opposite of the normal effect. Of course you're going to get downvotes. — Peter Cordes, Jan 23 '23 at 18:37
You've had this experience many times, yet you can't manage to (or haven't bothered to) provide a single concrete example of this effect? You mentioned a couple software packages, but didn't say anything about exactly how you changed the source or which temp vars you eliminated. If you could point to a specific change, I'd likely be able to figure out why it was faster / slower; I look at compiler-generated asm and performance effects on real CPUs all the time, so this isn't something I "don't have a clue" about. — Peter Cordes, Jan 23 '23 at 19:04
@ThomasMatthews I wasn't talking about unused variables, but pointless variables. E.g. variables used only once. Or variables used more than once inside the same expression could be avoided by moving the expression into a function and using a parameter for the former variable. a = expr; a*a Above code could be replaced by sqr(expr) which yielded better performance. Of course having sqr() defined as an inline function. — Frank Puck, Jan 24 '23 at 19:34
@PeterCordes I provided examples which are not public: Certain transistor models like bsimsoi and hisimsoi. I expected to get info from somebody knowing about the effect in order to give me insights what else could possibly result in better performance. — Frank Puck, Jan 24 '23 at 19:44

score 2 · Answer 1 · answered Jan 19 '23 at 18:51

2

No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.

If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.

Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.

answered Jan 19 '23 at 18:51

anatolyg

26,506
9
60
134

Potentially avoiding avoidable assignments makes the code simpler for the compiler, which in turn enables better optimization. – Frank Puck Jan 19 '23 at 19:30
1

Breaking up complex expressions, using temporary variables, is often an excellent aid to debugging. May also aid the compiler in generating more efficient code. For example, using a temporary variable in a function call, rather than placing the function call as a parameter will not generate extra code, but helps with debugging. – Thomas Matthews Jan 19 '23 at 21:02
For example: instead of `My_Function(Process_Data())`, one could do `const int result = Process_Data(); My_Function(result);`. Many times, the compiler will assign the result from `Process_Data()` to a register and pass the register to `My_Function`. Same behavior as using a temporary result variable. – Thomas Matthews Jan 19 '23 at 21:04
@FrankPuck: No, the middle paragraph in this answer about the compiler maybe running out of memory at compile time and giving up doesn't sound at all likely to me, barely even plausible. Modern ahead-of-time compilers can spend large amounts of time and memory if the code is complex enough. Much more likely you changed something about the order of operations and/or data dependency chains that was relevant for optimization. Note that FP math is not associative without `-ffast-math`, but your question doesn't show a [mcve]. – Peter Cordes Jan 19 '23 at 23:38
@FrankPuck: Another possibility is that some calling conventions (e.g. x86-64 System V) have no call-preserved FP registers, so temp vars have to get store/reloaded across function calls on C functions that don't inline. So recomputing something from other vars might avoid a store/reload dependency chain. – Peter Cordes Jan 19 '23 at 23:43
1

@PeterCordes Compiler doesn't run out of memory — it just decides for itself that trying to optimize a huge function will take too much time, and the user better get unoptimized code instead of waiting forever. Or so I conjecture. Real-life case. – anatolyg Jan 20 '23 at 19:57
@PeterCordes The effect happened independent of usage of user-written-functions, as I anyway asked the compiler (g++) to flatten the function using attribute((____flatten____)). The only calls left in the code where calls to math functions like exp/log/... – Frank Puck Jan 26 '23 at 21:38
@FrankPuck: math library functions like `exp` and `log` are perfect examples of what I was talking about: functions that can't inline, and might get called inside hot loops. – Peter Cordes Jan 27 '23 at 01:36

score 0 · Answer 2 · answered Jan 21 '23 at 21:08

Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).

An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.

As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.

Just my $0.02

This is why temp vars can make code *faster*, in cases where alias analysis and https://en.wikipedia.org/wiki/Common_subexpression_elimination weren't able to let the compiler do it for you. (Other reasons can include reordering FP dependency chains, since FP math isn't associative the compiler can't do it for you without `-ffast-math`.) The question is asking the opposite, an unlikely case where they're claiming temp vars made the code slower, but haven't confirmed that they compiled with optimization enabled, but say they were unable to create a [mcve]... — Peter Cordes, Jan 21 '23 at 21:25

score 0 · Answer 3 · answered Jan 23 '23 at 19:01

It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).

Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.

Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.

I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.

Is this assumption in fact the case?

"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.

Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.

GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).

This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

How does reordering numerical code in order to avoid temporary variables make the code faster?

3 Answers3