C++ "No raw loops" without losing perfomance

Question

So the 'new (old) big thing' is "No Raw Loops" in C++. I'm trying to write code that way, but it seems very inefficient. Yes, there are STL algoritms which can do about anything, but they don't seem very efficient.

I for instance have a situation where I want a pointer to a node in an array of nodes that has the highest score. Determining that score is a costly floating-point operation. So I implemented the STL algorithm version and compared it with the raw loop:

#include <cfloat>
#include <iostream>
#include <array>
#include <algorithm>
#include <numeric>

static int counter;

class Node {
public:
    auto Score() const -> double {
        std::cout << "complex calculation\n";
        counter++;
        return 1;
    }
};

int main()
{
    
    std::array<Node, 10> nodes;
    
    counter = 0;
    Node const* nodePtr = std::max_element(std::cbegin(nodes), std::cend(nodes),
        [](Node const& node1, Node const& node2) {
            return node1.Score() < node2.Score();
        });
    std::cout << "algorithm count " << counter << std::endl;
    
    counter = 0;
    double maxScore = -FLT_MAX;
    for (const auto& node : nodes) {
        auto score = node.Score();
        if (score > maxScore) {
            maxScore = score;
            nodePtr = &node;
        }
    }
    std::cout << "raw loop count " << counter << std::endl;
}

Evaluating this, for the STL version, the costly Score function is evaluated 18 times, while the raw loop only uses 10 evaluations...

Am I doing it wrong, or are raw loops just not that bad?

edit: After the suggestion by user58697 that cout and the static counter would prevent compiler optimization, I changed the code:

#include <cfloat>
#include <cmath>
#include <iostream>
#include <array>
#include <algorithm>
#include <numeric>
#include <random>
#include <chrono>

template <typename T>
class Random {
private:
    std::default_random_engine generator;
    std::uniform_real_distribution<T> distribution;
public:
    Random()
        : generator()
        , distribution(0.0, 1.0)
    {}

    auto operator()() {
        return distribution(generator);
    };
};

static Random<double> myRandom;

class Timer {
private:
    std::chrono::high_resolution_clock::time_point startTime{};
public:
    void Start() noexcept {
        startTime = std::chrono::high_resolution_clock::now();
    }
    [[nodiscard]] auto ElapsedMs() const noexcept {
        return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count();
    }
};

static Timer timer;

class Node {
private:
    double val;
public:
    Node() noexcept : val(myRandom()) {}

    [[nodiscard]] auto Score() const noexcept {
        auto score = std::sqrt(std::log(10.0 / val));
        score = std::sin(score) / std::cos(score);
        score = std::sqrt(std::sqrt(std::sqrt(std::sqrt(std::sqrt(score)))));
        score = std::pow(score, 1000);
        return score;
    }
};

int main()
{
    std::array<Node, 100000> nodes; // yeah, yeah... overloading the stack, I know

    for (auto i = 0; i < 2; i++) {
        timer.Start();
        Node const* nodePtr = &*std::max_element(std::cbegin(nodes), std::cend(nodes),
            [](Node const& node1, Node const& node2) {
                return node1.Score() < node2.Score();
            });
        std::cout << "algorithm elapsed time " << timer.ElapsedMs() << std::endl;

        timer.Start();
        double maxScore = -FLT_MAX;
        for (const auto& node : nodes) {
            auto score = node.Score();
            if (score > maxScore) {
                maxScore = score;
                nodePtr = &node;
            }
        }
        std::cout << "raw loop count " << timer.ElapsedMs() << std::endl;
    }
}

I run the loop twice to eliminate startup behavior... results of second loop (compiled with g++ 9.1 -O3):

algorithm elapsed time 16
raw loop count 8 (<== I see I forgot to change "count" to "time" :P)

So that's not it.

edit: Seeing this question got an upvote, people are still looking at it. Since this question was asked, C++20 was released. C++20's ranges library has a special feature which could have helped here, called Projection.

I.e. in this case you could use std::ranges::max_element or even std::ranges::max (which was missing from the old std algorithms) like

Node const* node = &*std::ranges::max_element(nodes, {}, &Node::Score);
...
Node const& node = std::ranges::max(nodes, {}, &Node::Score);

However, projection is not the solution here, due to the implementation choice, which doesn't use caching. The Proj projection function is called again and again for every argument of the Comp comparator function.

(The internal function call looks something like

return std::invoke(__comp, std::invoke(__proj, __a), std::invoke(__proj, __b)) ? __b : __a;

)

Regarding the use of the term STL : [What's the difference between “STL” and “C++ Standard Library”?](https://stackoverflow.com/questions/5205491/whats-the-difference-between-stl-and-c-standard-library). — François Andrieux, Oct 23 '19 at 20:11
See also: [Should one prefer STL algorithms over hand-rolled loops?](https://stackoverflow.com/q/135129/10077) — Fred Larson, Oct 23 '19 at 20:16
@FrançoisAndrieux it's something [Sean Parent said in 2013](https://channel9.msdn.com/Events/GoingNative/2013/Cpp-Seasoning), but in recent CppCon many presenters quoted him on this. — JHBonarius, Oct 23 '19 at 20:17
@FrançoisAndrieux It was advice Sean Parent gave back in at least 2013: https://www.youtube.com/watch?v=W2tWOdzgXHA — NathanOliver, Oct 23 '19 at 20:17
I'm just siding with r/cpp on this one where it doesn't matter any more today. People get the idea and your ability to have a technical conversation is not really hampered. P.S. They edited one of my answers over this technicality. — sweenish, Oct 23 '19 at 20:22
Maybe you are just using the wrong algorithm. std::max_element only loops through once but loops through the nodes so the comparison function has to call Score() on largest every time. std::accumulate calculates the max element much more like your second loop: `double max = std::accumulate(std::cbegin(nodes), std::cend(nodes), DBL_MIN, [](auto max, Node const& node1) { return std::max(node1.Score(),max); });` and there is probably a way to use std::bind to get rid of the lambda... — Jerry Jeremiah, Oct 23 '19 at 21:19
A side effect of `std::cout << "complex calculation\n";` prevents the compiler to optimize the code correctly. It really _must_ call `Score()` every time it is mentioned. Should it be pure, the compiler would cache its results. — user58697, Oct 24 '19 at 06:10
@FredLarson that's from 2008... with the evolution of C++ it's not up to date imho — JHBonarius, Oct 24 '19 at 11:02

Richard · Accepted Answer · 2019-10-23T21:16:13.823

Replacing raw loops with abstracted algorithms is good style because then you can reuse the algorithm many times but test it only once. Wrapping the loop this way may seem like syntactic sugar, but it greatly reduces the potential for bugs in your code because you can now do extensive unit tests on the abstracted algorithm and you never need to worry about mistakenly implementing it incorrectly when you need it.

However, you're comparing apples and oranges here. Your max_element implementation always calculates Score() for its comparison whereas your for loop caches the result of the Score() function.

A better implementation of Node might be:

class Node {
mutable:
    double cached_score = std::numeric_limits<double>::quiet_Nan();
public:
    auto Score() const -> double {
        if(std::isnan(cached_score)){
           std::cout << "complex calculation\n";
           counter++;
           cached_score = 1;
        }
        return cached_score;
    }
    void invalidate_cache() {
      cached_score = std::numeric_limits<double>::quiet_Nan();
    }
};

This way the complex calculation is only performed once.

Alternatively, write your own abstraction:

#include <cfloat>
#include <iostream>
#include <array>
#include <algorithm>
#include <numeric>

static int counter;

class Node {
public:
    auto Score() const -> double {
        std::cout << "complex calculation\n";
        counter++;
        return 1;
    }
};

template<class ForwardIt, class Evaluate, class Compare>
ForwardIt max_eval_element(
    ForwardIt first,
    ForwardIt last,
    Evaluate eval,
    Compare comp
){
    if (first == last) return last;

    ForwardIt largest = first;
    auto largest_val = eval(*first);
    ++first;
    for (; first != last; ++first) {
        const auto this_val = eval(*first);
        if (comp(largest_val, this_val)) {
            largest = first;
            largest_val = this_val;
        }
    }
    return largest;
}

int main()
{

    std::array<Node, 10> nodes;

    counter = 0;
    Node const* nodePtr = max_eval_element(std::cbegin(nodes), std::cend(nodes),
                                       [](Node const& node){ return node.Score(); },
                                       [](double const &a, double const &b) {
                                           return a<b;
                                       });
    std::cout << "algorithm count " << counter << std::endl;

    counter = 0;
    double maxScore = -FLT_MAX;
    for (const auto& node : nodes) {
        auto score = node.Score();
        if (score > maxScore) {
            maxScore = score;
            nodePtr = &node;
        }
    }
    std::cout << "raw loop count " << counter << std::endl;
}

In this instance, both loops perform the same number of evaluations.

Many internal codebases I've worked with have extensive libraries which extend the STL. It gives the teams I've worked on much greater confidence that their code has been written correctly and allows you to interpret complex operations at a glance. In this way, these abstractions also reduce the effort of understanding code and the effort of communication.

Shouldn't `Score()` look more like this instead? `auto Score() const -> double { if (std::isnan(cached_score)) { ... cached_score = 1; } return cached_score; }` — Remy Lebeau, Oct 23 '19 at 20:34
Thank you for both suggestions, but: in the first situation the size of Node grows. So there's again a (memory) performance impact. And in the second situation you're replacing a raw loop with a new function that effectively has the same raw loop, which just seems optical trickery. — JHBonarius, Oct 23 '19 at 20:39
@JHBonarius: If you're planning on using the `Score()` function repeatedly, then the space-time trade-off is one you'll have to consider anyway. — Richard, Oct 23 '19 at 20:58
@JHBonarius: See the first sentence of my answer. You're replacing a complex inline operation with a call to a generic function which you can write extensive unit tests for. If you use `max_eval_element` twice in a project, this halves the amount of testing you need to do versus the raw loops. It also provides you a function you can move between projects. If you look at the documentation for [`std::max_element`](https://en.cppreference.com/w/cpp/algorithm/max_element) you'll see that it too is just an abstraction. There's no magic here, only good design practices. — Richard, Oct 23 '19 at 21:00
OK, I'm going for this one. I modified it to `template > constexpr FwdIt max_eval_element(FwdIt first, FwdIt last, Eval eval, Pred pred = Pred())`, so I can call it with e.g. `return *max_eval_element(std::begin(nodes), std::end(nodes), std::mem_fn(&Node::Score));` for my code. still need to fix type detection though — JHBonarius, Oct 24 '19 at 09:08
I've just checked the ranges-v3 (C++20 base) implementation of this... The [max_element](https://github.com/ericniebler/range-v3/blob/master/include/range/v3/algorithm/max_element.hpp) offers an extra "projection" input, which could be used to apply a transformation. I.e. `auto nodeIt3 = rng::max_element(nodes, std::less(), [](Node const& node) { return node.Score(); });`, however, the implementation is kind of disappointing: `if(invoke(pred, invoke(proj, *first), invoke(proj, *tmp)))` i.e. the projection is applied twice. — JHBonarius, Oct 24 '19 at 14:42
I've checked this in [compiler explorer](https://godbolt.org/z/DKJ71d), and indeed the projection is evaluated twice... quite an inefficient implementation... — JHBonarius, Oct 24 '19 at 14:43

JHBonarius · Answer 2 · 2022-08-15T07:10:26.167

I just wanted to post the algo I used based on Richard's answer:

template <typename FwdIt, typename Proj, typename Comp = std::less<>>
constexpr FwdIt max_projected_element(FwdIt first, FwdIt last, Proj proj, Comp comp = {}) {
    auto found = first;
    if (first != last) {
        auto best = std::invoke(proj, *found);
        while (++first != last) {
            if (auto const value = std::invoke(proj, *first);
                std::invoke(comp, best, value)) {
                found = first;
                best = value;
            }
        }
    }
    return found;
}

which allows me to call the algo as

Node const* nodePtr = max_projected_element(
    std::cbegin(nodes), std::cend(nodes), &Node::Score);

score 1 · Answer 3 · answered Oct 23 '19 at 20:28

1

I think your case is somewhat pathological to std::max_element, since your comparator calls Score() on both elements every time, and you declare it to be expensive. So where you do:

for (const auto& node : nodes) {
    auto score = node.Score();
    if (score > maxScore) {
        maxScore = score;
        nodePtr = &node;
    }
}

... std::max_element effectively has to do:

for (const auto& node : nodes) {
    if (node.Score() > nodePtr->Score()) {
        nodePtr = &node;
    }
}

I think std::max_element would be just fine in a more normal case, where the value to compare by is cheap to access.

answered Oct 23 '19 at 20:28

Fred Larson

60,987
18
112
174

I disagree with you on "a more normal case". I rarely need to compare just rows of numbers. Usually I need to compare objects, based on the value of a member variable or an evaluation formula. And I think that is is a more common 'normal' case.. – JHBonarius Oct 23 '19 at 20:31
1

Even for containers of objects, most uses would reference a value that is trivial to access. – Fred Larson Oct 23 '19 at 20:33
1

@JHBonarius: Comparison based on "a member variable" access is *very different* from "an evaluation formula". In the former case, `max_element` is just fine; the later case, not so much. Optimizing the latter case requires knowing up-front that the value you're using to compare them is expensive to calculate and therefore should be cached. Which `min_element` cannot know. Essentially, your case is a performance issue because it's really two things: converting a list of some type into a list of values, then comparing based on those values. You optimize this through caching. – Nicol Bolas Oct 23 '19 at 21:11
@NicolBolas ... but that is not necessary with a raw loop. So I'd have to introduce a /new/ mechanism to overcome the limitations... that's a high cost just to fulfill this "no raw loops" thing... I think my conclusion is: the current infrastructure is too limited to remove all raw loops. I prefer Richard's solution in that case. – JHBonarius Oct 24 '19 at 09:24
@JHBonarius: "*but that is not necessary with a raw loop.*" And that's the tradeoff you *always* make for going high-level. The code looks more easily digestible and it's more difficult to make mistakes, but if your high-level needs do not exactly match the expectations of the tools at your disposal (in this case, that comparing values is cheap), then there may be additional overhead compared to a bespoke solution. – Nicol Bolas Oct 24 '19 at 13:24

score 0 · Answer 4 · answered Oct 24 '19 at 06:24

0

It seems that the problem is in how the test is written. The statements

    std::cout << "complex calculation\n";
    count++;

make a visible side effect, and the compiler is obliged to call Score every time it appears. If it was pure (that is, having no side effects), the compiler would optimize the redundant calls away by cacheing.

Try to use a real Score (I hope the real one is pure), and compare the execution times.

answered Oct 24 '19 at 06:24

user58697

7,808
1
14
28

1

I tried this. I removed counter and the cout. I added a scoring function that used a lot of `cmath`. Then I increased the node array to 100000 elements and used a `high_resolution_clock` to measure the elapsed time. I compiled with G++ 9.1 with -O3... Result: algorithm uses 22 ms, raw loop uses 8 ms... so I'm afraid this is not true. – JHBonarius Oct 24 '19 at 08:42

C++ "No raw loops" without losing perfomance

4 Answers4

Linked