Why is Range-v3 slower than the STL in this example?

Question

I'm playing around with the Range-v3 library to perform a glorified find_if and was curious why google-benchmark consistently ranks my Range-v3 code worse than my std::find_if approach. Both g++ and clang give the same pattern with -O3 and #define NDEBUG.

The specific example I have in mind is the following using the STL:

std::vector<int> lengths(large_number, random_number);

auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto found_index = std::distance(lengths.begin(), found);

This is somewhat contrived for the purpose of this illustration, but there would usually be a random generator for the to_find variable and random values in the lengths vector.

Using the Range-v3 library, I get the following code

using namespace ranges;    

std::vector<int> lengths(large_number, random_number);

auto const to_find = accumulate(lengths, 0l) / 2;

auto found_index = distance(lengths | view::partial_sum()
                                    | view::take_while([=](auto const i) {
                                         return i < to_find;
                                      }));

My question is why the Range-v3 is slower than the STL implementation. I understand this is still an experimental library, but perhaps there is something wrong with the code example or am I misusing the range concepts?

Edit

An example google-bench driver (not sure if correct)

#define NDEBUG

#include <numeric>
#include <vector>

#include <benchmark/benchmark.h>

#include <range/v3/all.hpp>

static void stl_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;

    while (state.KeepRunning()) {

        auto accumulated_length = 0l;
        auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
                               accumulated_length += val;
                               return to_find < accumulated_length;
                           });
        volatile long val = std::distance(lengths.begin(), found);
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

static void ranges_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = accumulate(lengths, 0l) / 2;

    while (state.KeepRunning())
    {
        volatile long val = distance(lengths | view::partial_sum()
                                             | view::take_while([=](auto const& i) {
                                                   return i <= to_find;
                                               }));
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);

BENCHMARK_MAIN();

Gives

ranges_search/2048          756 ns        756 ns     902091   20.1892GB/s
ranges_search/4096         1495 ns       1494 ns     466681   20.4285GB/s
ranges_search/32768       11872 ns      11863 ns      58902   20.5801GB/s
ranges_search/262144      94982 ns      94892 ns       7364   20.5825GB/s
ranges_search/524288     189870 ns     189691 ns       3688   20.5927GB/s
stl_search/2048             348 ns        348 ns    2000964   43.8336GB/s
stl_search/4096             690 ns        689 ns    1008295   44.2751GB/s
stl_search/32768           5497 ns       5492 ns     126097    44.452GB/s
stl_search/262144         44725 ns      44681 ns      15882   43.7122GB/s
stl_search/524288         91027 ns      90936 ns       7616   42.9563GB/s

with clang 4.0.1 and

ranges_search/2048         2309 ns       2307 ns     298507   6.61496GB/s
ranges_search/4096         4558 ns       4554 ns     154520   6.70161GB/s
ranges_search/32768       36482 ns      36454 ns      19191   6.69726GB/s
ranges_search/262144     287072 ns     286801 ns       2438   6.81004GB/s
ranges_search/524288     574230 ns     573665 ns       1209   6.80928GB/s
stl_search/2048             299 ns        298 ns    2340691   51.1437GB/s
stl_search/4096             592 ns        591 ns    1176783   51.6363GB/s
stl_search/32768           4692 ns       4689 ns     149460   52.0711GB/s
stl_search/262144         37718 ns      37679 ns      18611   51.8358GB/s
stl_search/524288         75247 ns      75173 ns       9244   51.9633GB/s

with gcc 6.3.1. My machine has a Haswell generation processor. Both were compiled and executed with

g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out

Please provide an [mcve], specific performance measurement results, compiler and system information. — Zulan, Jun 07 '17 at 15:32
Just an observation: from glancing at a lot of the ranges code, it's very powerful but with all the lambdas returning lambdas returning lambdas etc it doesn't seem like it's optimized for performance rather than ease of syntax. — AndyG, Jun 07 '17 at 19:57
@AndyG Why would that be a problem for performance, lambdas returning lambdas? Lambdas are not externally-linked functions: they don't have to "physically" exist in the executable. They can very easily be elided in templated code, as Range-v3 is — KABoissonneault, Jun 07 '17 at 20:00
@KABoissonneault: That's a fair point. I'd like to see what actually gets elided etc. On another front, I recall the CPPCon talk from Eric Neibler about the tendency of the ranges libraries to produce more copies than necessary due to lack of language support. — AndyG, Jun 07 '17 at 20:02

score 25 · Accepted Answer · answered Jun 07 '17 at 20:03

view::partial_sum over a range of int produces a range of int. If to_find > INT_MAX, the internal accumulator will overflow resulting in UB. In practice the algorithm most likely walks the entire input and returns the end iterator.

Conversely, your accumulated_length in the non-range-v3 approach is a long. It does not overflow, and thus has defined behavior / returns before processing the entire input.

The range-v3 approach will have correct behavior if you transform the input range into a range of long before passing it through partial_sum:

auto found_index = distance(lengths
  | view::transform(convert_to<long>{}) | view::partial_sum()
  | view::take_while([=](auto const i) { return i < to_find; }));

Even with this correctness fix, it is still remarkedly slower than using the standard algorithms in my testing. Compiling this test program:

#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#ifdef USE_RV3
#include <range/v3/core.hpp>
#include <range/v3/algorithm.hpp>
#include <range/v3/numeric.hpp>
#include <range/v3/view.hpp>

#else
#include <algorithm>
#include <numeric>
#endif

int main() {
    constexpr size_t large_number = 1UL << 30;

    int random_number = 42;

    std::vector<int> const lengths(large_number, random_number);

    using clock_t = std::chrono::steady_clock;
    auto const start = clock_t::now();

#ifdef USE_RV3
    auto const to_find = ranges::accumulate(lengths, 0l) / 2;
#else
    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
#endif

    auto const elapsed1 = clock_t::now() - start;

#ifdef USE_RV3
    auto const found_index = ranges::distance(lengths
            | ranges::view::transform(ranges::convert_to<long>{})
            | ranges::view::partial_sum()
            | ranges::view::take_while([=](auto const i) { return !(to_find < i); }));
#else
    auto accumulated_length = 0l;
    auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                                accumulated_length += val;
                                return to_find < accumulated_length;
                            });
    auto const found_index = std::distance(lengths.begin(), found);
#endif

    auto const elapsed2 = clock_t::now() - start;

    std::cout << "elapsed1: "
        << std::chrono::duration<double, std::milli>(elapsed1).count()
        << " ms, to_find: " << to_find << "\n"
           "elapsed2: "
        << std::chrono::duration<double, std::milli>(elapsed2).count()
        << " ms, result: " << found_index << '\n';
}

with

g++-6 -std=c++14 -Ofast -march=native -DNDEBUG rv3.cpp -I ~/dev/range-v3/include -S -o -

both without and with -DUSE_RV3 and diffing the assembly output is interesting. The code generated up through the initialization of elapsed1 is identical for both cases. There are notable differences in the intermediate section between the initializations of elapsed1 and elapsed2. gcc does a much better job of optimizing the std version: the hot loop is all together in a single code run with branches for terminating conditions. The range-v3 version is uglier and jumps around quite a bit; I speculate that we need to wiggle the details of the implementation of partial_sum to make it more transparent to optimizers.

Thanks for the nice response. I already came across that overflow issue and fixed in my google-bench driver...does the code in ranges::view::transform(ranges::convert_to{}) just convert each integer on the fly to a long type? — fast asleep, Jun 07 '17 at 20:08
@user8125498 Yip. `convert_to` is a function object that simply returns its argument `static_cast`ed to `T`. — Casey, Jun 07 '17 at 20:27
If this is an implementation detail, do I need to open a bug on GitHub? — fast asleep, Jun 07 '17 at 20:46
@user8125498 Yes, please report a bug. I've got a sort-of-better rework in progress at https://github.com/ericniebler/range-v3/compare/master...CaseyCarter:partial_sum_cleanup, but an issue report will help me remember to come back and finish it ;) — Casey, Jun 08 '17 at 20:49

Why is Range-v3 slower than the STL in this example?

Edit

1 Answers1