42

I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is the code:

#include <string>
#include <iostream>
#include <vector>
#include <algorithm>

#include <sys/time.h>

double getRealTime()
{
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double) tv.tv_sec + 1.0e-6 * (double) tv.tv_usec;
}

inline int my_max_element(const std::vector<int> &vec, int size)
{
    auto it = vec.begin();
    int max = *it++;
    for (; it != vec.end(); it++)
    {
        if (*it > max)
        {
            max = *it;
        }
    }
    return max;
}

int main()
{
    const int size = 1 << 20;
    std::vector<int> vec;
    for (int i = 0; i < size; i++)
    {
        if (i == 59)
        {
            vec.push_back(1000000012);
        }
        else
        {
            vec.push_back(i);
        }
    }

    double startTime = getRealTime();
    int maxIter = *std::max_element(vec.begin(), vec.end());
    double stopTime = getRealTime();
    double totalIteratorTime = stopTime - startTime;

    startTime = getRealTime();
    int maxArray = my_max_element(vec, size);
    stopTime = getRealTime();
    double totalArrayTime = stopTime - startTime;

    std::cout << "MaxIter = " << maxIter << std::endl;
    std::cout << "MaxArray = " << maxArray << std::endl;
    std::cout << "Total CPU time iterator = " << totalIteratorTime << std::endl;
    std::cout << "Total CPU time array = " << totalArrayTime << std::endl;
    std::cout << "iter/array ratio: = " << totalIteratorTime / totalArrayTime << std::endl;
    return 0;
}

Output:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.000989199
Total CPU time array = 0.000293016
iter/array ratio: = 3.37592

on average std::max_element takes x3 more time then my_max_element. So why am I able to create a much faster std function so easily? Should I stop using std and write my own functions since std is so slow?

Note: at first I though it was because I'm using and integer i in a for loop instead of an iterator, but that seams to not matter now.

Compiling info:

g++ (GCC) 4.8.2

g++ -O3 -Wall -c -fmessage-length=0 -std=c++0x

Marc Glisse
  • 7,550
  • 2
  • 30
  • 53
MoonBun
  • 4,322
  • 3
  • 37
  • 69
  • 1
    what compiler and stl are you using? – Alexander Oh Sep 02 '14 at 11:18
  • 4
    Are you compiling with optimizations? – interjay Sep 02 '14 at 11:19
  • How did you compile your code? Did you enable optimizations (e.g. by compiling with `g++ -Wall -O2` if using [GCC](http://gcc.gnu.org/)...)? – Basile Starynkevitch Sep 02 '14 at 11:19
  • I added all the info, @Alex where do I get stl version? – MoonBun Sep 02 '14 at 11:21
  • 9
    Have you tried reversing the order of calls? Caching might be at work here. – Angew is no longer proud of SO Sep 02 '14 at 11:25
  • 11
    [this](http://coliru.stacked-crooked.com/a/8773bcf1e194c4e6) is what I got. max_element is so slow that even max_element is 70% faster – Bryan Chen Sep 02 '14 at 11:25
  • @Angew yes, it went down to 2.5.. good but not enough. – MoonBun Sep 02 '14 at 11:26
  • @Vladp no much difference from my pc. almost same output! – billz Sep 02 '14 at 11:33
  • 1
    Most of the difference seems to come from vectorization (only one version is vectorized). File a PR in gcc's bugzilla? – Marc Glisse Sep 02 '14 at 11:33
  • And remove `-pthread` makes them closer – billz Sep 02 '14 at 11:37
  • Try adding `-DNDEBUG` to your command line. Maybe there are assertions inside the standard library. – Angew is no longer proud of SO Sep 02 '14 at 11:37
  • 4
    `my_max_element` breaks on empty vectors whereas `std::max_element` is required to detect and handle that case – M.M Sep 02 '14 at 11:38
  • 1
    @MattMcNabb, the cost of such a check is absolutely negligible in this case. – Griwes Sep 02 '14 at 11:40
  • [Here's](http://coliru.stacked-crooked.com/a/efe521953468dddb) a result with `NDEBUG` and `my_max_element` before `std::max_element`. `iter/array ratio: = 0.995192` which is a very marginal win for the standard library. – eerorika Sep 02 '14 at 11:49
  • With g++ 4.8 and `-O2` the difference disappears (and both time align to the time of `std::max_element` at `-O3`); it's the optimizer that is pulling some strange tricks. – Matteo Italia Sep 02 '14 at 11:49
  • [This might be a good explaination of why this is happening](http://randomascii.wordpress.com/2013/11/24/stdmin-causing-three-times-slowdown-on-vc/) – Mgetz Sep 02 '14 at 11:54
  • Does this happen if you use an array with random elements? – xen-0 Sep 02 '14 at 12:57
  • 2
    Your timing code implements _catastrophic cancellation_, which makes the whole question pointless. Please do _proper_ handling of time values (and, do them on a scale where they are meaningful and not within the scheduler's variation). That is, **never** do any such thing as subtract two nearly identical float/double values, and never use double for time anyway. Also, do benchmarks which run least 2-3 seconds (better 10 or 20), not a few microseconds. A single interrupt or such can easily cause what you're seeing. – Damon Sep 02 '14 at 13:19
  • @Damon: Not really. Run the benchmark x times, the interrupt deviation will be negligible. Also double have 52 bits of precision, which makes them well suited to store the 32 bits of the tv_usec of measurement time. – xryl669 Sep 02 '14 at 16:15
  • 1
    @xryl669: Not at all. Floating point works with powers of two, and 10E-6 is not a power of two, so multiplying with that is not great to begin with as it loses precision. Also, subtracting two almost identical numbers (differing only at the 5th digit) loses precision. The measurements are therefore total bogus. There are a lot of articles on the internet about this. Every measured difference may be real, or random, or rounding / catastrophic cancellation, or scheduling jitter, or something different. You can't tell. No accurate measurement forfeits everything else before you even begin. – Damon Sep 02 '14 at 20:23
  • 1
    @Angew, libstdc++ doesn't use `assert` anywhere, `NDEBUG` won't affect it – Jonathan Wakely Sep 11 '14 at 15:01

4 Answers4

29

Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


There seems to be an optimizer bug/artifact. Compare the times of:

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;

  while(++__first != __last)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;
  ++__first;

  for(; __first != __last; ++__first)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

w/o commoning     with commoning
*                 *
**                 *
 **                 *
  **                 *
  * *                 *
  *  *                 *
  *   *                 *

Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

With the while loop (2.88743 ms; gist):

    movq    %rdi, %rax
    jmp .L49
.L51:
    movl    (%rdi), %edx
    cmpl    %edx, (%rax)
    cmovl   %rdi, %rax
.L49:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    jne .L51

With the for loop (1235.55 μs):

    leaq    4(%rdi), %rdx
    movq    %rdi, %rax
    cmpq    %rsi, %rdx
    je  .L53
    movl    (%rdi), %ecx
.L54:
    movl    (%rdx), %r8d
    cmpl    %r8d, %ecx
    cmovl   %rdx, %rax
    cmovl   %r8d, %ecx
    addq    $4, %rdx
    cmpq    %rdx, %rsi
    jne .L54
.L53:

If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

    movl    (%rdi), %ecx
    movq    %rdi, %rax
.L57:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    je  .L60
.L59:
    movl    (%rdi), %edx
    cmpl    %edx, %ecx
    jge .L57
    movq    %rdi, %rax
    addq    $4, %rdi
    movl    %edx, %ecx
    cmpq    %rsi, %rdi
    jne .L59
.L60:

The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].

Ruslan
  • 18,162
  • 8
  • 67
  • 136
dyp
  • 38,334
  • 13
  • 112
  • 177
  • 1
    If you can run this on a different machine / g++ version, please comment/add your results. – dyp Sep 02 '14 at 11:56
  • I used 1000*1000*1000 elements for my tests, but I don't quite trust the OP's measurement code. Each test has been conducted several times, and in normal + reversed order. – dyp Sep 02 '14 at 12:07
  • I get no such difference with g++ 4.8.2. I'll post my results in a moment. – R. Martinho Fernandes Sep 02 '14 at 12:42
  • @R.MartinhoFernandes Interesting. I haven't used nonius before; I downloaded the git repo, included nonius/main and used the nonius/nonius.h++ header. Mean is 54.12 vs 34.06 ns for a vector size of 1000*1000*1000; for both the "variance is severely inflated by outliers". – dyp Sep 02 '14 at 13:27
  • "variance is severely inflated by outliers" is usually a sign that the results might be skewed by external factors (but now I suspect I have a bug in that logic, since "found 0 outliers [...] variance is severely inflated by outliers" does not make sense ;) But that's off-topic. The point is that I can't reproduce a difference like the one you experienced. I think we need to look at the generated assembly. – R. Martinho Fernandes Sep 02 '14 at 13:31
  • @R.MartinhoFernandes *"I think we need to look at the generated assembly"* Agreed. I did some additional tests, since you slightly changed the function template, and got a different means. I also get a third set of different values when using a local vs namespace-scope lambda predicate. -- It's a community wiki on purpose, feel free to add some assembly (I haven't much experience with assembly, let alone optimized programs or nonius). If it's indeed an optimizer problem, it might disappear when embedding in nonius, though. – dyp Sep 02 '14 at 13:47
  • I'll look into it some more later and add whatever I find. Sorta busy now :) – R. Martinho Fernandes Sep 02 '14 at 13:48
  • @R.MartinhoFernandes Off-topic: "[Harward information](http://manwe.rmf.io:9876/max_element.html)" (looks like a typo) – dyp Sep 02 '14 at 13:50
  • 1
    Hint to localize the difference, already mentioned in the comments: `-fno-tree-vectorize`. Please don't forget to report it to gcc's bugzilla at some point. – Marc Glisse Sep 02 '14 at 15:42
  • 1
    gcc-4.9.1, Intel Core i5: mean 2.88743 ms for while, 1235.55 μs for for. I'm using function pointers to prevent inlining and volatile to prevent no-op elimination; this results in test bodies that have identical asm apart from the precise call to the max_element function. – ecatmur Sep 03 '14 at 09:33
  • @dyp if you're seeing ns I think the loop is being no-opped. Try using volatile to prevent this: https://gist.github.com/ecatmur/8c0b50dc4e30e1206a98#file-max_element-c – ecatmur Sep 03 '14 at 09:52
  • @R.MartinhoFernandes I tried your code and the calls got no-opped; I had to add volatile: https://gist.github.com/ecatmur/8c0b50dc4e30e1206a98#file-max_element-c – ecatmur Sep 03 '14 at 09:53
  • @ecatmur Aha! I tried your code for 1000*1000*100 elements, the results are 240 ms / 100 ms / 30 ms with g++4.8.2, Intel Xeon E3; with the original 1 << 20 elements, the results are 2.5 ms / 1 ms / 0.3 ms Thanks for your contributions. – dyp Sep 03 '14 at 10:41
  • 1
    Using GCC 9.2.0 shows no issue with both my_max_element_orig() and my_max_element_changed() (from the accepted answer) - both show the same results. However I found significant difference if I use pointers and iterators - using iterators is about 2 times worse againt raw pointers. See my question https://stackoverflow.com/questions/58810888/is-there-a-way-to-optimize-std-algorithms – Rom098 Nov 12 '19 at 01:52
10

You are probably running your test in 64-bit mode, where sizeof(int) == 4, but sizeof(std::vector<>::iterator) == 8, so that assignment in the loop to int (what my_max_element does) is faster than to std::vector<>::iterator (this is what std::max_element does).

If you change std::vector<int> to std::vector<long> results change in favour to std::max_element:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.00429082
Total CPU time array = 0.00572205
iter/array ratio: = 0.749875

One important note: when benchmarking disable the CPU frequency scaling, so that the CPU does not switch gears in the middle of the benchmark.


But I think something else is at play here, since just changing the loop variable from int to long does not change the results...

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • 1
    Are you suggesting that 64 bit aligned writes are slower than 32 bit aligned writes on a natively 64 bit architecture? (Heck, not to mention register writes.) – Griwes Sep 02 '14 at 11:58
  • You are right, even without changing for to while, long is much faster with std. – MoonBun Sep 02 '14 at 11:58
  • I rewrote `my_max_element` to use iterators internally as `max_element` do and it doesn't change much (http://coliru.stacked-crooked.com/a/367efade62603da0). – Matteo Italia Sep 02 '14 at 12:00
  • @Mgetz, AH. I forgot about that crap. Thanks. – Griwes Sep 02 '14 at 12:01
2

It's a simple issue of cache. To wit, the first time you load memory, in this case the contents of the vector, it's always considerably slower than if it's been recently accessed. I copied and pasted your code with GCC 4.9.

When the functions are reversed, the ratio is 1. When they're in the original order, the ratio is 1.6.

This still seems like a fundamental misoptimization by GCC in the case of max_element to me. However, your function times are so low, they will be dominated by CPU noise like the above cache effects, instead of any meaningful comparison.

Reversed, Original

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 2
    Actually, with g++ 4.8 and `-O3` running the functions in a loop ((with no changes in the original array) I get the same results as OP (`std::max_element` three times slower); the results hold even if I swap the order of execution of the two functions. – Matteo Italia Sep 02 '14 at 11:51
  • @Matteo: G++ 4.9 simply doesn't seem to quite reproduce the misoptimization here. – Puppy Sep 02 '14 at 11:55
  • 1
    Disable your CPU frequency scaling. – Maxim Egorushkin Sep 02 '14 at 11:56
0

I performed a test for max_element for my specific usecase and did several implementations using intrinsics which can be found here: https://github.com/XapaJIaMnu/maxelem_test

My implementations improve upon vanilla std::max_element by a factor of 2 at least, provided your data is not sorted (or almost sorted) in ascending order.

These are test results on arrays of random size of up to 2000 elements that contain floats generated from a uniform distribution of `[5:5] and are run 100000 times.

GCC 11.2 clang 14 icc 2022.1.0
std::max_element 2.6696s 0.4221s 0.4662s
sequential 1.0831s 1.1924s 1.1472s
AVX512 max + max_reduce 0.2412s 0.2152s 0.2142s
AVX512 max_reduce only 0.2570s 0.2629s 0.2325s
AVX512 cmp_ps_mask 0.1884s 0.1826s 0.1833s
AVX512 ^ + vectorized overhang 0.2097s 0.2089s 0.2072s
AVX cmp_ps + movemask 0.2181s 0.1697s 0.1702s
SSE cmplt_psp + movemask 0.2692s 0.2051s 0.2221s

I'm too lazy to explore assembly at the moment for more explainable results.

XapaJIaMnu
  • 1,408
  • 3
  • 12
  • 28