10

I was testing algorithms and run into this weird behavior, when std::accumulate is faster than a simple for cycle.

Looking at the generated assembler I'm not much wiser :-) It seems that the for cycle is optimized into MMX instructions, while accumulate expands into a loop.

This is the code. The behavior manifests with -O3 optimization level, gcc 4.7.1

#include <vector>                                                                                                                                                                                                                                                              
#include <chrono>                                                                                                                                                                                                                                                              
#include <iostream>                                                                                                                                                                                                                                                            
#include <random>                                                                                                                                                                                                                                                              
#include <algorithm>                                                                                                                                                                                                                                                           
using namespace std;                                                                                                                                                                                                                                                           

int main()                                                                                                                                                                                                                                                                     
{                                                                                                                                                                                                                                                                              
    const size_t vsize = 100*1000*1000;                                                                                                                                                                                                                                        

    vector<int> x;
    x.reserve(vsize);

    mt19937 rng;
    rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));

    uniform_int_distribution<uint32_t> dist(0,10);

    for (size_t i = 0; i < vsize; i++)
    {
        x.push_back(dist(rng));
    }

    long long tmp = 0;
    for (size_t i = 0; i < vsize; i++)
    {
        tmp += x[i];
    }

    cout << "dry run " << tmp << endl;

    auto start = chrono::high_resolution_clock::now();
    long long suma = accumulate(x.begin(),x.end(),0);
    auto end = chrono::high_resolution_clock::now();

    cout << "Accumulate runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;

    start = chrono::high_resolution_clock::now();
    suma = 0;
    for (size_t i = 0; i < vsize; i++)
    {
        suma += x[i];
    }
    end = chrono::high_resolution_clock::now();

    cout << "Manual sum runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma <<  endl;

    return 0;
}
Šimon Tóth
  • 35,456
  • 20
  • 106
  • 151
  • 1
    As much as I'd love to try to answer this. I can't because VS2010 doesn't have ``... :( – Mysticial Nov 06 '12 at 01:58
  • This is why everybody says to prefer the standard algorithms over rolling your own. – Mark Ransom Nov 06 '12 at 02:05
  • 1
    By "cycle", do you mean "loop"? I was reading that as processor cycle. But if I replace "cycle" with "loop", the question makes so much more sense. – Mysticial Nov 06 '12 at 02:06
  • @Mysticial cleaned it up – Šimon Tóth Nov 06 '12 at 02:13
  • Can you show us the disassembly? And what are the numbers you are getting? – Mysticial Nov 06 '12 at 02:14
  • @Mysticial https://gist.github.com/4022145 – Šimon Tóth Nov 06 '12 at 02:20
  • @Mysticial: If you have any insight as to why the compiler would be able to better optimize the manual loop than the `accumulate` algorithm I'd like to hear about them. – Matthieu M. Nov 06 '12 at 07:57
  • @MatthieuM. That's the opposite what's happening here. But I bet it isn't too hard to come up with an example loop that will beat whatever the compiler can generate with `accumulate()`. – Mysticial Nov 06 '12 at 08:01
  • @Mysticial: Actually, if you look at Blastfurnace's answer, you'll see that `accumulate` is slower *if also using `long long`*. – Matthieu M. Nov 06 '12 at 08:03
  • @MatthieuM. Oh ic... If you look at Retired Ninja's answer, it says that the manual loop gets unrolled while `accumulate()` doesn't. Go figure... micro-optimization FTW. – Mysticial Nov 06 '12 at 08:05
  • c.f. [Difference in performance: std::accumulate vs std::inner_product vs Loop](https://stackoverflow.com/questions/52167695/difference-in-performance-stdaccumulate-vs-stdinner-product-vs-loop) / [Why is std::accumulate so slow?](https://stackoverflow.com/questions/21685426/why-is-stdaccumulate-so-slow) – underscore_d Sep 24 '18 at 16:08

3 Answers3

10

When you pass the 0 to accumulate, you are making it accumulate using an int instead of a long long.

If you code your manual loop like this, it will be equivalent:

int sumb = 0;
for (size_t i = 0; i < vsize; i++)
{
    sumb += x[i];
}
suma = sumb;

or you can call accumulate like this:

long long suma = accumulate(x.begin(),x.end(),0LL);
Vaughn Cato
  • 63,448
  • 5
  • 82
  • 132
7

I have some different results using Visual Studio 2012

// original code
Accumulate runtime 93600 ms
Manual sum runtime 140400 ms

Note that the original std::accumulate code isn't equivalent to the for loop because the third parameter to std::accumulate is an int 0 value. It performs the summation using an int and only at the end stores the result in a long long. Changing the third parameter to 0LL forces the algorithm to use a long long accumulator and results in the following times.

// change std::accumulate initial value -> 0LL
Accumulate runtime 265200 ms
Manual sum runtime 140400 ms

Since the final result fits in an int I changed suma and std::accumulate back to using only int values. After this change the MSVC 2012 compiler was able to auto-vectorize the for loop and resulted in the following times.

// change suma from long long to int
Accumulate runtime 93600 ms
Manual sum runtime 46800 ms
Blastfurnace
  • 18,411
  • 56
  • 55
  • 70
3

After fixing the accumulate issue others noted I tested with both Visual Studio 2008 & 2010 and accumulate was indeed faster than the manual loop.

Looking at the disassembly I saw some additional iterator checking being done in the manual loop so I switched to just a raw array to eliminate it.

Here's what I ended up testing with:

#include <Windows.h>
#include <iostream>
#include <numeric>
#include <stdlib.h>

int main() 
{
    const size_t vsize = 100*1000*1000;                                                                                                                                                                                                                                        
    int* x = new int[vsize];

    for (size_t i = 0; i < vsize; i++) x[i] = rand() % 1000;

    LARGE_INTEGER start,stop;
    long long suma = 0, sumb = 0, timea = 0, timeb = 0;

    QueryPerformanceCounter( &start );
    suma = std::accumulate(x, x + vsize, 0LL);
    QueryPerformanceCounter( &stop );
    timea = stop.QuadPart - start.QuadPart;

    QueryPerformanceCounter( &start );
    for (size_t i = 0; i < vsize; ++i) sumb += x[i];
    QueryPerformanceCounter( &stop );
    timeb = stop.QuadPart - start.QuadPart;

    std::cout << "Accumulate: " << timea << " - " << suma << std::endl;
    std::cout << "      Loop: " << timeb << " - " << sumb << std::endl;

    delete [] x;
    return 0;
}

Accumulate: 633942 - 49678806711
      Loop: 292642 - 49678806711

Using this code, the manual loop easily beats accumulate. The big difference is the compiler unrolled the manual loop 4 times, otherwise the generated code is almost identical.

Retired Ninja
  • 4,785
  • 3
  • 25
  • 35