maximum of 3 values, performance of left-associative version vs. right-associative version

Question

The following code shows a big performance difference of the two versions of min_3 on my machine (Windows 7, VC++ 2015, release).

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>

template <typename X>
const X& max_3_left( const X& a, const X& b, const X& c )
{
    return std::max( std::max( a, b ), c );
}

template <typename X>
const X& max_3_right( const X& a, const X& b, const X& c )
{
    return std::max( a, std::max( b, c ) );
}

int main()
{
    std::random_device r;
    std::default_random_engine e1( r() );
    std::uniform_int_distribution<int> uniform_dist( 1, 6 );
    std::vector<int> numbers;
    for ( int i = 0; i < 1000; ++i )
        numbers.push_back( uniform_dist( e1 ) );

    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for ( int i = 0; i < 1000; ++i )
        for ( int j = 0; j < 1000; ++j )
            for ( int k = 0; k < 1000; ++k )
                sum1 += max_3_left( numbers[i], numbers[j], numbers[k] );
    auto finish1 = std::chrono::high_resolution_clock::now();
    std::cout << "left  " << sum1 << " " <<
        std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1).count()
        << " us" << std::endl;

    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for ( int i = 0; i < 1000; ++i )
        for ( int j = 0; j < 1000; ++j )
            for ( int k = 0; k < 1000; ++k )
                sum2 += max_3_right( numbers[i], numbers[j], numbers[k] );
    auto finish2 = std::chrono::high_resolution_clock::now();
    std::cout << "right " << sum2 << " " <<
        std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2).count()
        << " us" << std::endl;
}

Output:

left  739861041 796056 us
right 739861041 1442495 us

On ideone the difference is smaller but still not negligible.

Why does this difference exist?

have you tried calculating right first and then the left and see if the time remains the same? — jimmu, Apr 01 '16 at 13:55
Intrestingly if you change `return std::max(std::max(a, b), c);` to `return std::max(std::initializer_list{a, b, c});` in `max_3_left`the times get much closer — NathanOliver, Apr 01 '16 at 14:02
Interesting. Also, assuming all else constant, calculation of sum2 is ~twice the sum1. — jimmu, Apr 01 '16 at 14:13
Likely not an answer, but a compiler may optimize all the loops away, since your code effectively calculates `sum1=max_3_left(numbers[999], numbers[999], numbers[999]);` and, i.e., `sum1=numbers[999];`. What if you change `sum1=` to `sum1+=` and `sum2=` accordingly? — Daniel Langr, Apr 01 '16 at 14:17
@DanielLangr: Thanks for the remark. I now changed the code from `sum =` to `sum +=`. The difference is still the same though. — Tobias Hermann, Apr 01 '16 at 14:32

score 6 · Accepted Answer · edited May 23 '17 at 12:16

gcc and clang (and presumably MSVC) fail to realize that max is an associative operation like addition. v[i] max (v[j] max v[k]) (max_3_right) is the same as (v[i] max v[j]) max v[k] (max_3_left). I'm writing max as an infix operator to point out the similarity with + and other associative operations.

Since v[k] is the only input that's changing inside the inner loop, it's obviously a big win to hoist the (v[i] max v[j]) out of the inner loop.

To see what's actually going on, we as always have to look at the asm. To make it easy to find the asm for the loops, I split them out into separate functions. (Making one template functions with the max3 function as a parameter would be more C++-like). This has the added advantage of taking the code we want optimized out of main, which gcc marks as "cold", disabling some optimizations.

#include <algorithm>
#define SIZE 1000
int sum_maxright(const std::vector<int> &v) {
    int sum = 0;
    for ( int i = 0; i < SIZE; ++i )
        for ( int j = 0; j < SIZE; ++j )
            for ( int k = 0; k < SIZE; ++k )
                sum += max_3_right( v[i], v[j], v[k] );
    return sum;
}

The inner-most loop of that compiles to (gcc 5.3 targeting x86-64 Linux ABI with -std=gnu++11 -fverbose-asm -O3 -fno-tree-vectorize -fno-unroll-loops -march=haswell with some hand annotations)

## from outer loops: rdx points to v[k] (starting at v.begin()).  r8 is v.end().  (r10 is v.begin)
## edi is v[i], esi is v[j]
## eax is sum

 ## inner loop.  See the full asm on godbolt.org, link below
.L10:
        cmp     DWORD PTR [rdx], esi      # MEM[base: _65, offset: 0], D.92793
        mov     ecx, esi                  # D.92793, D.92793
        cmovge  ecx, DWORD PTR [rdx]      # ecx = max(v[j], v[k])
        cmp     ecx, edi      # D.92793, D.92793
        cmovl   ecx, edi      # ecx = max(ecx, v[i])
        add     rdx, 4    # pointer increment
        add     eax, ecx  # sum, D.92793
        cmp     rdx, r8   # ivtmp.253, D.92795
        jne     .L10      #,

Clang 3.8 makes similar code for the max_3_right loop, with two cmov instructions inside the inner loop. (Use the compiler dropdown in the Godbolt Compiler Explorer to see.)

gcc and clang both optimize the way you'd expect for the max_3_left loop, hoisting everything but a single cmov out of the inner loop.

## register allocation is slightly different here:
## esi = max(v[i], v[j]).    rdi = v.end()
.L2:
        cmp     DWORD PTR [rdx], ecx      # MEM[base: _65, offset: 0], D.92761
        mov     esi, ecx  # D.92761, D.92761
        cmovge  esi, DWORD PTR [rdx]        # MEM[base: _65, offset: 0],, D.92761
        add     rdx, 4    # ivtmp.226,
        add     eax, esi  # sum, D.92761
        cmp     rdx, rdi  # ivtmp.226, D.92762
        jne     .L2       #,

So there's much less going on in this loop. (On Intel pre-Broadwell, cmov is a 2-uop instruction, so one fewer cmov is a big deal.)

BTW, cache prefetching effects can't possibly explain this:

The inner loop accesses numbers[k] sequentially. The repeated accesses to numbers[i] and numbers[j] are hoisted out of the inner loop by any decent compiler, and wouldn't confuse modern prefetchers even if they weren't.

Intel's optimization manual says that up to 32 streams of prefetch patterns can be detected and maintained (with a limit of one forward and one backward per 4k page), for Sandybridge-family microarchitectures (section 2.3.5.4 Data Prefetching).

The OP completely failed to say anything about what hardware he ran this microbenchmark on, but since real compilers hoist the other loads leaving only the most trivial access pattern, it hardly matters.
one vector of 1000 ints (4B) only takes 4kiB. This means the whole array easily fits in L1D cache, so there's no need for any kind of prefetching in the first place. It all stays hot in L1 cache pretty much the entire time.

Daniel Langr · Answer 2 · 2016-04-04T20:52:39.947

3

As molbdnilo pointed out, the problem could be with the order of loops. When calculating sum1, the code can be rewritten as:

for ( int i = 0; i < 1000; ++i )
   for ( int j = 0; j < 1000; ++j ) {
      auto temp = std::max(numbers[i], numbers[j]);
      for ( int k = 0; k < 1000; ++k )
            sum1 += std::max(temp, numbers[k]);
   }

The same cannot be applied for the calculation of sum2. However, when I reoredered the second loops as:

for ( int j = 0; j < 1000; ++j )
   for ( int k = 0; k < 1000; ++k )
      for ( int i = 0; i < 1000; ++i )
         sum2 += ...;

I got the same times for both calculations. (Moreover, both calculations are much faster with -O3 than with -O2. The former seemingly turns on vectorization according to disassembled output.)

edited Apr 04 '16 at 20:52

answered Apr 01 '16 at 14:55

Daniel Langr

22,196
3
50
93

Why did you *yada yada* the sum2 += ? – Captain Giraffe Apr 01 '16 at 20:11
You mean vectorization, not virtualization :P But yes, `-O3` enables `-ftree-vectorize` while `-O2` doesn't. – Peter Cordes Apr 04 '16 at 20:31

score 2 · Answer 3 · answered Apr 01 '16 at 14:13

2

This is related to data cache prefetching at hardware level.

If you use the left associative version, the elements of the array are used/loaded in the sequence expected by CPU cache, with less latency.

The right associative version breaks the prediction and will generate more cache misses, hence the slower performance.

answered Apr 01 '16 at 14:13

Christophe

68,716
7
72
138

@VladFeinstein That will teach me to looks at the number of digits in the output. ingesting more coffee. – NathanOliver Apr 01 '16 at 14:32
3

Would this be the reason that changing the indexing order from i,j,k to k,j,i also "flips" the timings? (It does for me, at least.) – molbdnilo Apr 01 '16 at 14:36
1

Changing the loop order from (i,j,k) to (j,k,i) for the second loop gives me practically the same times for both calculations (g++ 4.9.3). With `-O2`, the left variant is still a bit faster, but with `-O3` (uses vectorizaton according to disassembly), a bit faster is the right variant. – Daniel Langr Apr 01 '16 at 14:43
This is probably an entirely new question, but I strongly feel the compiler should be able to realize the cache consequences and optimize this; unless max itself is an intrinsic, – Captain Giraffe Apr 01 '16 at 20:10
@CaptainGiraffe: this answer is not correct, but the correct answer still depends on compilers being dumb and not fully optimizing after inlining the STL `max` functions. – Peter Cordes Apr 04 '16 at 20:30
Answering such a question without looking at the assembly code is highly speculative. – Apr 04 '16 at 20:40

maximum of 3 values, performance of left-associative version vs. right-associative version

3 Answers3