The important qualifier here is the meaning of "least number of instructions". If that is to be interpreted as causing the CPU to perform the fewest steps, and we further stipulate there are no advanced techniques to be employed, like SIMD, GPU programming or OMP (or other auto parallel technologies)....just C or C++, then consider:
Assuming something like:
int a[ 10 ];
Which is filled with data at runtime, and will always contain 10 entries (0 through 9)
The std::accumulate
does a nice job here, creating a tight loop in the assembler, no mess...just quick:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
If course, some const int signifying the size of the array 'a' would be in order.
This compares curiously to:
for( int n=0; n < 10; ++n ) r += a[ n ];
The compiler very smartly emits 10 add instructions unrolled - it doesn't even bother with a loop.
Now, this means that in std::accumulate
, though the loop is tight, there will be, at the minimum, two add instructions for each element (one for the sum, and one to increment the iterator). Add to that the comparison instruction and a conditional jump, and there are at least 4 instructions per item, or about 40 machine language steps of various cost in ticks.
On the other hand, the unrolled result of the for loop is just 10 machine steps, which the CPU can very likely schedule with great cache friendliness, and no jumps.
The for loop is definitely faster.
The compiler "knows" what you're trying to do, and gets to the job as well as you might think through it with the proposed code you posted.
Further, if the size of the array gets too outlandish for unrolling the loop, the compiler automatically performs the classic optimization that std::accumulate
does not appear to do for some reason...i.e., performing two additions per loop (when it constructs a loop for reason of the number of elements).
Using VC 2012, this source:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
int z = 0;
int *ap = a;
int *ae = &a[9];
while( ap <= ae ) { z += *ap; ++ap; }
int z2 = 0;
for (int n=0; n < 10; ++n ) z2 += a[ n ];
Produces the following assembler snippets on a release build in VC2012
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
00301270 33 D2 xor edx,edx
00301272 B8 D4 40 30 00 mov eax,3040D4h
00301277 EB 07 jmp wmain+10h (0301280h)
00301279 8D A4 24 00 00 00 00 lea esp,[esp]
00301280 03 10 add edx,dword ptr [eax]
00301282 83 C0 04 add eax,4
00301285 3D F8 40 30 00 cmp eax,3040F8h
0030128A 75 F4 jne wmain+10h (0301280h)
while( ap <= ae ) { z += *ap; ++ap; }
003012A0 03 08 add ecx,dword ptr [eax]
003012A2 03 70 04 add esi,dword ptr [eax+4]
003012A5 83 C0 08 add eax,8
003012A8 3D F4 40 30 00 cmp eax,3040F4h
003012AD 7E F1 jle wmain+30h (03012A0h)
003012AF 3D F8 40 30 00 cmp eax,3040F8h
003012B4 77 02 ja wmain+48h (03012B8h)
003012B6 8B 38 mov edi,dword ptr [eax]
003012B8 8D 04 0E lea eax,[esi+ecx]
003012BB 03 F8 add edi,eax
for (int n=0; n < 10; ++n ) z2 += a[ n ];
003012BD A1 D4 40 30 00 mov eax,dword ptr ds:[003040D4h]
003012C2 03 05 F8 40 30 00 add eax,dword ptr ds:[3040F8h]
003012C8 03 05 D8 40 30 00 add eax,dword ptr ds:[3040D8h]
003012CE 03 05 DC 40 30 00 add eax,dword ptr ds:[3040DCh]
003012D4 03 05 E0 40 30 00 add eax,dword ptr ds:[3040E0h]
003012DA 03 05 E4 40 30 00 add eax,dword ptr ds:[3040E4h]
003012E0 03 05 E8 40 30 00 add eax,dword ptr ds:[3040E8h]
003012E6 03 05 EC 40 30 00 add eax,dword ptr ds:[3040ECh]
003012EC 03 05 F0 40 30 00 add eax,dword ptr ds:[3040F0h]
003012F2 03 05 F4 40 30 00 add eax,dword ptr ds:[3040F4h]
Based on comments I decided to try this in XCode 7, with drastically different results. This is it's unroll of the for loop:
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a(%rip), %rax
Ltmp22:
##DEBUG_VALUE: do3:z2 <- EAX
movq %rax, %rcx
shrq $32, %rcx
.loc 1 58 33 is_stmt 0 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
addl %eax, %ecx
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+8(%rip), %rax
Ltmp23:
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %eax, %edx
addl %ecx, %edx
shrq $32, %rax
addl %edx, %eax
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+16(%rip), %rcx
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %ecx, %edx
addl %eax, %edx
shrq $32, %rcx
addl %edx, %ecx
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+24(%rip), %rax
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %eax, %edx
addl %ecx, %edx
shrq $32, %rax
addl %edx, %eax
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+32(%rip), %rcx
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %ecx, %edx
addl %eax, %edx
shrq $32, %rcx
addl %edx, %ecx
This may not look as clean as VC's simple list, but it may run as fast because the setup (movq or movl) for each addition may run parallel in the CPU as the previous entry is finishing it's addition, costing little to nothing by comparison to VC's simple, clean 'looking' series of adds on memory sources.
The following is Xcode's std::accumulator. It SEEMS there's a init required, but then it performs a clean series of additions having unrolled the loop, which VC did not do.
.file 37 "/Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" "numeric"
.loc 37 75 27 is_stmt 1 ## /Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/numeric:75:27
movq _a(%rip), %r14
Ltmp11:
movq %r14, -48(%rbp) ## 8-byte Spill
Ltmp12:
shrq $32, %r14
movq _a+8(%rip), %rbx
movq %rbx, -56(%rbp) ## 8-byte Spill
shrq $32, %rbx
movq _a+16(%rip), %r13
movq %r13, -72(%rbp) ## 8-byte Spill
shrq $32, %r13
movq _a+24(%rip), %r15
movq %r15, %r12
shrq $32, %r12
Ltmp13:
movl _a+32(%rip), %eax
Ltmp14:
movq -48(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
movq -56(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
addl %ebx, %r14d
movq -72(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
addl %r13d, %r14d
addl %r15d, %r14d
addl %r12d, %r14d
addl -64(%rbp), %r14d ## 4-byte Folded Reload
The bottom line here is that the optimizations we rely upon from compilers differs so widely and wildly from one compiler to another that we should rely upon them, but watch.
LLVM is quite exemplary, and understands std::accumulate
better than VC, it seems - but this short investigation can't reveal if that is a difference in the implementation of the libary or of the compiler. There could be important differences in the implementation of Xcode's std::accumulate
which give the compiler more insight than VC's version of the library.
That applies more generally to algorithms, even those from numeric. std::accumulate
is a for loop. It is likely expanded inline as for loop based on pointers into the array, which is why VC's choice to create a loop for std::accumulate was echoed in it's choice to produce a loop for the code using int *
to loop through the array, but unrolled the loop for the for loop using an integer to reference entries in the array by index. In other words, it really did no better in a straight for loop when pointers were used, and that suggests it's VC's optimizer, not the library, in this case.
This follows Stroustrup's own favorite example of the idea of information available to the compiler, comparing qsort from C and sort from C++. qsort
takes a function pointer to perform the comparison, cutting off the compiler from understand the comparison, forcing it to call a function via a pointer. The C++ sort
function, on the other hand, takes a functor, which conveys more information about the comparison. That could still result in a function call, but the optimizer has the opportunity to understand the comparison sufficiently to make it inline.
In VC's case, for whatever reason (we'd have to as Microsoft), the compiler is confused when looping through the array via pointers. The information given to it is different than with the loop using an integer to index the array. It understands that, but not the pointers. LLVM, by contrast, understood both (and more). The difference of information is not important to LLVM, but it is to VC. Since std::accumulate
is really an inline representing a for loop, and that loop is processed via pointers, it escaped VC's recognition, just as VC did in the straight for loop based on pointers. If a specialization could be made for integer arrays, such that accumulated looped with indexes rather than pointers, VC would respond with better output, but it shouldn't be so.
A poor optimizer can miss the point, and a poor implementation of the library could confuse the optimizer, which means that under the best circumstances std::accumulate
can perform about as well as the for loop for a simple array of integers, producing an unrolled version of the loop creating the sum, but not always. However, there's little to get in the way of the compiler's understanding in a for loop..everything is right there, and the implementation of the library can't mess it up, it's all up to the compiler at that point. For that, VC shows it's weakness.
I tried all settings on VC to try to get it to unroll std::accumulate
, but so far it never did (haven't tried newer versions of VC).
It didn't take much to get Xcode to unroll the loop; LLVM seems to have deeper engineering. It may have a better implementation of the library, too.
Incidentally, the C code example I posted at top was used in VC, which didn't recognize that the three different summations were related. LLVM on XCode did, which meant the first time I tried it there it simply adopted the answer from std::accumulate and did nothing otherwise. VC was really weak on that point. In order to get Xcode to perform 3 separate tests, I randomized the array before each call...otherwise Xcode realized what I was doing where VC did not.