Why isn't this C++ member function optimized by the compiler with -O3?

Question

The norm member function in the C++ vector class declared below is marked as const and (as far as I can tell) does not contain any side effects.

template <unsigned int N>
struct vector {
  double v[N];

  double norm() const {
    double ret = 0;
    for (int i=0; i<N; ++i) {
      ret += v[i]*v[i];
    }
    return ret;
  }
};

double test(const vector<100>& x) {
  return x.norm() + x.norm();
}

If I call norm multiple times on a const instantiation of vector (see test function above) with the gcc compiler (version 5.4) and optimizations turned on (i.e. -O3) then the compiler inlines norm, but still calculates the result of norm multiple times, even though the result should not change. Why doesn't the compiler optimize out the second call to norm and only calculate this result once? This answer seems to indicate that the compiler should perform this optimization if the compiler determines that the norm function does not have any side-effects. Why doesn't this happen in this case?

Note that I'm determining what the compiler produces using the Compiler Explorer, and that the assembly output for gcc version 5.4 is given below. The clang compiler gives similar results. Note also that if I use gcc's compiler attributes to manually mark norm as a const function using __attribute__((const)), then the second call is optimized out, as I wanted, but my question is why gcc (and clang) do not do this automatically since the norm definition is available?

test(vector<100u>&):
        pxor    xmm2, xmm2
        lea     rdx, [rdi+800]
        mov     rax, rdi
.L2:
        movsd   xmm1, QWORD PTR [rax]
        add     rax, 8
        cmp     rdx, rax
        mulsd   xmm1, xmm1
        addsd   xmm2, xmm1
        jne     .L2
        pxor    xmm0, xmm0
.L3:
        movsd   xmm1, QWORD PTR [rdi]
        add     rdi, 8
        cmp     rdx, rdi
        mulsd   xmm1, xmm1
        addsd   xmm0, xmm1
        jne     .L3
        addsd   xmm0, xmm2
        ret

(const seems wrong, I think pure is better, but then the optimization doesn't happen) This is a question of order of optimizations. If the function is const, x+x is optimized to 2*x (single appearance of x) very early. Otherwise, if you mark the function as noinline, the second call is optimized away as redundant. But if inlining happens first, then it becomes much harder for the compiler to notice that both loops are computing the same thing (loop fusion looks like a potentially useful step in that direction). — Marc Glisse, Mar 06 '17 at 06:49
Does the same thing happen when passing the vector by value? Just because you have a const reference doesn't meant it isn't being modified elsewhere. A const reference is only a signal that the callee (not the caller) will not change the object being referred to. IMO. In the general case it would be wrong for GCC to optimize out the second call because you added __attribute__((const)) but it might have been able to do more complex analysis and determine that the second call could be elided in a limited test case. — Jarra McIntyre, Mar 06 '17 at 07:55
I guess that in a not properly mutexed environment another function might change the contents in the array in between the calls of `norm`. — JHBonarius, Mar 06 '17 at 08:28
More interesting to me is that the compiler is optimizing with SSE, but does the multiplication of doubles 1 at a time (mulsd) instead of 2 at a time (mulpd). More evidence that if you have time critical code, don't depend on the compiler to do "the right thing". — BitBank, Mar 06 '17 at 13:24

score 4 · Accepted Answer · edited May 23 '17 at 11:54

The compiler can calculate the result of norm and reuse it multiple times. E.g. with the -Os switch:

test(vector<100u> const&):
        xorps   xmm0, xmm0
        xor     eax, eax
.L2:
        movsd   xmm1, QWORD PTR [rdi+rax]
        add     rax, 8
        cmp     rax, 800
        mulsd   xmm1, xmm1
        addsd   xmm0, xmm1
        jne     .L2
        addsd   xmm0, xmm0
        ret

The missing optimization isn't due to not-associative-floating-point-math or some observable-behavior-issue.

In a not properly mutexed environment another function might change the contents in the array in between the calls of norm

It may happen, but it's not a concern for the compiler (e.g. https://stackoverflow.com/a/25472679/3235496).

Compiling the example with the -O2 -fdump-tree-all switch you can see that:

g++ correctly detects vector<N>::norm() as a pure function (output file .local-pure-const1);
inlining happens at early stages (output file .einline).

Also note that marking norm with __attribute__ ((noinline)) the compiler performs CSE:

test(vector<100u> const&):
    sub     rsp, 8
    call    vector<100u>::norm() const
    add     rsp, 8
    addsd   xmm0, xmm0
    ret

Marc Glisse is (probably) right.

A more advanced form of Common Subexpression Elimination is required to un-inline the recurrent expression.

Why isn't this C++ member function optimized by the compiler with -O3?

1 Answers1