I'm playing around with the Range-v3 library to perform a glorified find_if
and was curious why google-benchmark consistently ranks my Range-v3 code worse than my std::find_if
approach. Both g++ and clang give the same pattern with -O3
and #define NDEBUG
.
The specific example I have in mind is the following using the STL:
std::vector<int> lengths(large_number, random_number);
auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
accumulated_length += val;
return to_find < accumulated_length;
});
auto found_index = std::distance(lengths.begin(), found);
This is somewhat contrived for the purpose of this illustration, but there would usually be a random generator for the to_find
variable and random values in the lengths
vector.
Using the Range-v3 library, I get the following code
using namespace ranges;
std::vector<int> lengths(large_number, random_number);
auto const to_find = accumulate(lengths, 0l) / 2;
auto found_index = distance(lengths | view::partial_sum()
| view::take_while([=](auto const i) {
return i < to_find;
}));
My question is why the Range-v3 is slower than the STL implementation. I understand this is still an experimental library, but perhaps there is something wrong with the code example or am I misusing the range concepts?
Edit
An example google-bench driver (not sure if correct)
#define NDEBUG
#include <numeric>
#include <vector>
#include <benchmark/benchmark.h>
#include <range/v3/all.hpp>
static void stl_search(benchmark::State &state) {
using namespace ranges;
std::vector<long> lengths(state.range(0), 1l);
auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
while (state.KeepRunning()) {
auto accumulated_length = 0l;
auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
accumulated_length += val;
return to_find < accumulated_length;
});
volatile long val = std::distance(lengths.begin(), found);
}
state.SetBytesProcessed(int64_t(state.iterations()) *
int64_t(state.range(0)) * sizeof(long));
}
static void ranges_search(benchmark::State &state) {
using namespace ranges;
std::vector<long> lengths(state.range(0), 1l);
auto const to_find = accumulate(lengths, 0l) / 2;
while (state.KeepRunning())
{
volatile long val = distance(lengths | view::partial_sum()
| view::take_while([=](auto const& i) {
return i <= to_find;
}));
}
state.SetBytesProcessed(int64_t(state.iterations()) *
int64_t(state.range(0)) * sizeof(long));
}
BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);
BENCHMARK_MAIN();
Gives
ranges_search/2048 756 ns 756 ns 902091 20.1892GB/s
ranges_search/4096 1495 ns 1494 ns 466681 20.4285GB/s
ranges_search/32768 11872 ns 11863 ns 58902 20.5801GB/s
ranges_search/262144 94982 ns 94892 ns 7364 20.5825GB/s
ranges_search/524288 189870 ns 189691 ns 3688 20.5927GB/s
stl_search/2048 348 ns 348 ns 2000964 43.8336GB/s
stl_search/4096 690 ns 689 ns 1008295 44.2751GB/s
stl_search/32768 5497 ns 5492 ns 126097 44.452GB/s
stl_search/262144 44725 ns 44681 ns 15882 43.7122GB/s
stl_search/524288 91027 ns 90936 ns 7616 42.9563GB/s
with clang 4.0.1 and
ranges_search/2048 2309 ns 2307 ns 298507 6.61496GB/s
ranges_search/4096 4558 ns 4554 ns 154520 6.70161GB/s
ranges_search/32768 36482 ns 36454 ns 19191 6.69726GB/s
ranges_search/262144 287072 ns 286801 ns 2438 6.81004GB/s
ranges_search/524288 574230 ns 573665 ns 1209 6.80928GB/s
stl_search/2048 299 ns 298 ns 2340691 51.1437GB/s
stl_search/4096 592 ns 591 ns 1176783 51.6363GB/s
stl_search/32768 4692 ns 4689 ns 149460 52.0711GB/s
stl_search/262144 37718 ns 37679 ns 18611 51.8358GB/s
stl_search/524288 75247 ns 75173 ns 9244 51.9633GB/s
with gcc 6.3.1. My machine has a Haswell generation processor. Both were compiled and executed with
g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out