It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i]
might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr
can help the compiler figure it out, or float ai = arr[i];
can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0
(no optimization) is supposed to be slow. If you're compiling without at least -O2
, preferably -O3 -march=native -ffast-math -flto
, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c
into fma(a,b,c)
is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast
, allowing it across expressions. clang defaults to strict
or no
, but supports -ffp-contract=fast
. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast
makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float
or double
costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store
. But -std=c++11
or whatever (instead of -std=gnu++11
) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float
or double
according to the type of the data, with instructions like mulsd
(scalar double) or mulss
(scalar single). So it implements FLT_EVAL_METHOD == 0
instead of x87's 2
. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.