I have the code listed below:
for(auto i =0;i<k;++i)
printf("%d\n",(va1+va2-(va1-va2))[i]);
where va1
and va2
are two valarray<int>
objects, k
is the size of va1
and va2
. What I am expecting is the compiler optimizes the printf
line like:
printf("%d\n", 2*va2[i]);
But instead, both Intel compiler (13.1) and CLang (3.4) did not do such optimization. For example, the Intel compiler outputed the assembly code:
..B1.18: # Preds ..B1.19 ..B1.17
movl (%r14,%r15,4), %r9d #18.42
movl $.L_2__STRING.1, %edi #18.9
movl (%r12,%r15,4), %r8d #18.42
xorl %eax, %eax #18.9
lea (%r9,%r8), %esi #18.9
subl %r8d, %r9d #18.9
subl %r9d, %esi #18.9
..___tag_value_main.18: #18.9
call printf #18.9
..___tag_value_main.19: #
# LOE rbx r12 r13 r14 r15
..B1.19: # Preds ..B1.18
incq %r15 #17.25
cmpq %r13, %r15 #17.21
jl ..B1.18 # Prob 82% #17.21
where r13
stores the value of k
, r14
and r12
are the beginning of memory of va1
and va2
, respectively. r15
is the iterator variable i
. From the code, what it does is:
load va1[i]
load va2[i]
add va1[i], va2[i] ==> %esi
sub va1[i], va2[i] ==> %r9d
sub %esi, %r9d ==> esi
print %esi
Why it is not optimized (even with -O3) to
load va2[i]
add va2[i], va2[i] => %esi
print %esi
with a peephole optimization? Gcc 4.8.2 does the optimization in this case, but cannot handle -(va1[i]+va2[i])+(va1[i]-va2[i])
.
It looks like one possible reason is expression templates are used in the code shown previously. Now the question is, why the compiler stopped optimization just one step before perfection? How expression template prevented the step forward?
NOTE Well, the answer can always be "because the compiler is not designed to do that optimization". But as far as I learned from the dragon book, the compiler should do the optimization iteratively until it cannot do anything better.