19

I'm trying to sum array elements using a simple for loop, an std::accumulate and a manualy unrolled for loop. As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop. This is my code, I compiled it with gcc 4.7 with -O3 flag. Visual Studio will need different rdtsc function implementation.

#include <iostream>
#include <algorithm>
#include <numeric>
#include <stdint.h>


using namespace std;

__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}

class mytimer
{
 public:
  mytimer() { _start_time = rdtsc(); }
  void   restart() { _start_time = rdtsc(); }
  uint64_t elapsed() const
  { return  rdtsc() - _start_time; }

 private:
  uint64_t _start_time;
}; // timer

int main()
{
    const int num_samples = 1000;
    float* samples = new float[num_samples];
    mytimer timer;
    for (int i = 0; i < num_samples; i++) {
        samples[i] = 1.f;
    }
    double result = timer.elapsed();
    std::cout << "rewrite of " << (num_samples*sizeof(float)/(1024*1024)) << " Mb takes " << result << std::endl;

    timer.restart();
    float sum = 0;
    for (int i = 0; i < num_samples; i++) {
        sum += samples[i];
    }
    result = timer.elapsed();
    std::cout << "naive:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    float* end = samples + num_samples;
    sum = 0;
    for(float* i = samples; i < end; i++) {
        sum += *i;
    }
    result = timer.elapsed();
    std::cout << "pointers:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    sum = 0;
    sum = std::accumulate(samples, end, 0);
    result = timer.elapsed();
    std::cout << "algorithm:\t" << result << ", sum = " << sum << std::endl;

    // With ILP
    timer.restart();
    float sum0 = 0, sum1 = 0;
    sum = 0;
    for (int i = 0; i < num_samples; i+=2) {
        sum0 += samples[i];
        sum1 += samples[i+1];
    }
    sum = sum0 + sum1;
    result = timer.elapsed();
    std::cout << "ILP:\t\t" << result << ", sum = " << sum << std::endl;
}
Steven
  • 1,365
  • 2
  • 13
  • 28
Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
  • Have you tried `-ffast-math`? The compiler is not allowed to reorder floating-point operations without it. And efficient accumulation requires reordering the sums. – Mysticial Feb 10 '14 at 18:49
  • No, I'm not. But this option will affect all three loops equally, I guess. – Evgeny Lazin Feb 10 '14 at 18:51
  • "Much" slower? How much slower? 1%? 10% 100? – Gabe Feb 10 '14 at 18:56
  • 3
    @MikeSeymour My guess is that `std::accumulate` determine the type it accumulates from the argument type (as the standard requires), which results in converting each `float` to an `int`. – James Kanze Feb 10 '14 at 18:56
  • @JamesKanze: Indeed, I didn't think of that. – Mike Seymour Feb 10 '14 at 18:58
  • I might also mention that with optimization, I would expect any decent compiler to optimize the loops away, since the resulting `sum` is never used. Of course, this optimization probably applies to `std::accumulate` as well, since the source code of the function is available. – James Kanze Feb 10 '14 at 19:00
  • naive: 4127, sum = 1000 pointers: 4127, sum = 1000 algorithm: 27895, sum = 1000 ILP: 2128, sum = 1000 – Evgeny Lazin Feb 10 '14 at 19:04
  • @Lazin, you should unroll by three times for the best result. See my answer here. http://stackoverflow.com/questions/21090873/loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell. – Z boson Feb 11 '14 at 11:49
  • 1
    possible dupes (albeit opposite of each other): [Why is accumulate faster than a simple for cycle?](https://stackoverflow.com/questions/13243274/why-is-accumulate-faster-than-a-simple-for-cycle) / [Difference in performance: std::accumulate vs std::inner_product vs Loop](https://stackoverflow.com/questions/52167695/difference-in-performance-stdaccumulate-vs-stdinner-product-vs-loop) – underscore_d Sep 24 '18 at 16:03

2 Answers2

34

For starters, your use of std::accumulate is summing integers. So you're probably paying the cost of converting each of the floating point to integer before adding it. Try:

sum = std::accumulate( samples, end, 0.f );

and see if that doesn't make a difference.

YSC
  • 38,212
  • 9
  • 96
  • 149
James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • @stefan Ah yes, if he's using `float`. (I'd avoid using `float` for summing any reasonably large number of things. For that matter, with enough elements, even `std::accumulate` using a `double` will be inaccurate. As will his loops, of course.) – James Kanze Feb 10 '14 at 18:58
  • 1
    That's it. Once you replace `0` with `0.0F`, the results become identical, give or take the statistical error (I used a slow-ticking timer for this [demo on ideone](http://ideone.com/XXtnyR)). – Sergey Kalinichenko Feb 10 '14 at 18:59
  • 2
    And if you increase the number of iterations to 1 million, [`accumulate` is faster](http://coliru.stacked-crooked.com/a/d7d124936ff76055) – Praetorian Feb 10 '14 at 19:00
  • 2
    @Praetorian And neither are very accurate. You can't naively sum a million values using `float`, and expect to be anywhere near the correct results. – James Kanze Feb 10 '14 at 19:01
  • 1
    I've changed 0 to 0.0f and std::accumulate start to work as fast as simple loop. Great answer! – Evgeny Lazin Feb 10 '14 at 19:05
4

Since you (apparently) care about doing this fast, you might also consider trying to multi-thread the computation to take advantage of all available cores. I did a pretty trivial rewrite of your naive loop to use OpenMP, giving this:

timer.restart();
sum = 0;

// only real change is adding the following line:
#pragma omp parallel for schedule(dynamic, 4096), reduction(+:sum)
for (int i = 0; i < num_samples; i++) {
    sum += samples[i];
}
result = timer.elapsed();
std::cout << "OMP:\t\t" << result << ", sum = " << sum << std::endl;

Just for grins, I also did a little rewriting on your unrolled loop to allow semi-arbitrary unrolling, and adding OpenMP as well:

static const int unroll = 32;
real total = real();
timer.restart();
double sum[unroll] = { 0.0f };
#pragma omp parallel for reduction(+:total) schedule(dynamic, 4096)
for (int i = 0; i < num_samples; i += unroll) {
    for (int j = 0; j < unroll; j++)
        total += samples[i + j];
}
result = timer.elapsed();
std::cout << "ILP+OMP:\t" << result << ", sum = " << total << std::endl;

I also increased the array size (substantially) to get somewhat more meaningful numbers. The results were as follows. First for a dual-core AMD:

rewrite of 4096 Mb takes 8269023193
naive:          3336194526, sum = 536870912
pointers:       3348790101, sum = 536870912
algorithm:      3293786903, sum = 536870912
ILP:            2713824079, sum = 536870912
OMP:            1885895124, sum = 536870912
ILP+OMP:        1618134382, sum = 536870912

Then for a quad-core (Intel i7):

rewrite of 4096 Mb takes 2415836465
naive:          1382962075, sum = 536870912
pointers:       1675826109, sum = 536870912
algorithm:      1748990122, sum = 536870912
ILP:            751649497, sum = 536870912
OMP:            575595251, sum = 536870912
ILP+OMP:        450832023, sum = 536870912

From the looks of things, the OpenMP versions are probably hitting limitations on memory bandwidth--the OpenMP versions make more use of the CPU than the un-threaded versions, but still only get to around 70% or so, indicating some other than the CPU is acting as a bottleneck.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Nicely done! By the way, timing with TSC register is not a good choice for this task. It will count context switches and other tasks scheduled on the same CPU, not only thread you trying to measure. It is more suitable for short pieces of code running on the single thread. – Evgeny Lazin Feb 11 '14 at 10:29
  • I don't understand your unroll code. You define double sum[unroll] but never use it. In your loop you only have one accumulator (`total`) and not several. The compiler won't do this for you. You have to define a number of accumulators equal to the latency and throughput of floating point addition. On SB/IB that's 3 accumulators and on Haswell it's 10. http://stackoverflow.com/questions/21090873/loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell – Z boson Feb 11 '14 at 12:57
  • @Zboson: Sorry 'bout the confusion. I'd written a version of the code that used `sum[unroll]`, but then rewrote without it. OpenMP can define separate accumulators for you--the `reduction(+:total)` means that it allocates a separate "total" for each thread, then adds them together at the end. – Jerry Coffin Feb 11 '14 at 15:15
  • @JerryCoffin, what OpenMP does in a reduction is create a separate accumulator for each thread not multiple accumulators per core. That's not unrolling. You still need to do three reductions with OpenMP on SB/IB to get the best result. – Z boson Feb 11 '14 at 15:20
  • 1
    @Zboson: As I said, I had used multiple accumulators at one point (that's how the definition ended up there). The code wasn't measurably faster with it, so simplifying the code seemed like the right thing to do. – Jerry Coffin Feb 11 '14 at 15:28
  • @JerryCoffin, for large arrays, the code is memory bound so you probably won't notice much befit from unrolling for arrays that don't fit in the L1 or L2 cache. – Z boson Feb 11 '14 at 15:46
  • @Zboson: I suppose I could try with a small enough array to fit, and see if the multiple-accumulators made a difference for that case. – Jerry Coffin Feb 11 '14 at 16:00