This is a follow-up of this SO Answer. Given a flat input range and three size_t
dimensions, the code creates a nested random_access_range
of random_access_range
s of random_access_range
s, modelling a three-dimensional array.
Iterating over the elements in the "multidimensional" range using a nested-for loop and indices is a bit slower than directly iterating over the elements of the input range (factor 4 slower). I suppose some performance drop can be expected, but a factor of 4 hurts a bit.
Even worse, recursively views::join
ing the multidimensional range back to a flat range and iterating over this flat range is slower by a factor of 20. Having read through the comments in this Github issue it can be expected that views::join
will come with some additional overhead, but a factor of 20 seems a bit much.
How can this large overhead with views::join
be explained? Am I using it wrong or is something fishy with my benchmark? Is there anything that can be done to speed up the code or are ranges just a poor choice for this kind of application?
Implementation
The code can be found under the Quickbench link above, I will add it here for completeness:
#include <vector>
#include <ranges>
#include <cassert>
#include <iostream>
template <size_t dim>
struct Slice {
// default constructor leaves start at zero and end at dim. Correspondes to the whole dimension
constexpr Slice() = default;
// Create a slice with a single index
constexpr Slice(size_t i) : begin(i), end(i+1) {
assert( (0 <= i) && (i < dim));
}
// Create a slice with a start and an end index
constexpr Slice(size_t s, size_t e) : begin(s), end(e+1) {
assert( (0 <= s) && (s <= e) && (e < dim) );
}
size_t begin {0};
size_t end {dim};
};
// An adaptor object to interpret a flat range as a multidimensional array
template <size_t dim, size_t... dims>
struct MD {
constexpr static auto dimensions = std::make_tuple(dim, dims...);
consteval static size_t size(){
if constexpr (sizeof...(dims) > 0) {
return dim*(dims * ...);
}
else {
return dim;
}
}
// returns a multidimensional range over the elements in the flat array
template <typename Rng>
constexpr static auto slice(
Rng&& range,
Slice<dim> const& slice,
Slice<dims> const&... slices
)
{
return slice_impl(range, 0, slice, slices...);
}
template <typename Rng>
constexpr static auto slice_impl(
Rng&& range,
size_t flat_index,
Slice<dim> const& slice,
Slice<dims> const&... slices
)
{
if constexpr (std::ranges::sized_range<Rng>) { assert(std::size(range) >= size()); }
static_assert(sizeof...(slices) == sizeof...(dims), "wrong number of slice arguments.");
if constexpr (sizeof...(slices) == 0)
{
// end recursion at inner most range
return range | std::views::drop(flat_index*dim + slice.begin) | std::views::take(slice.end - slice.begin);
}
else
{
// for every index to be kept in this dimension, recurse to the next dimension and increment the flat_index
return std::views::iota(slice.begin, slice.end) | std::views::transform(
[&range, flat_index, slices...](size_t i){
return MD<dims...>::slice_impl(range, flat_index*dim + i, slices...);
}
);
}
}
// convenience overload for the full view
template <typename Rng>
constexpr static auto slice(Rng&& range){
return slice(range, Slice<dim>{}, Slice<dims>{}...);
}
};
// recursively join a range of ranges
// https://stackoverflow.com/questions/63249315/use-of-auto-before-deduction-of-auto-with-recursive-concept-based-fun
template <typename Rng>
auto flat(Rng&& rng) {
using namespace std::ranges;
auto joined = rng | views::join;
if constexpr (range<range_value_t<decltype(joined)>>) {
return flat(joined);
} else {
return joined;
}
}
Test case
Iterate over two 6x6x6 slices out of 10x10x10 arrays and add the elements of one slice to another.
// define the dimensions of a 3d-array
constexpr size_t nx{10}, ny{10}, nz{10};
// define the contents of two nx x ny x nz arrays in and out
std::vector<double> Vout(nx*ny*nz, 0.);
std::vector<double> Vin(nx*ny*nz, 1.23);
// define some slice indices for each dimension
size_t lx{0}, ly{0}, lz{0};
size_t hx{5}, hy{5}, hz{5};
auto r_in = MD<nx,ny,nz>::slice(Vin, {lx, hx}, {ly, hy}, {lz, hz});
auto r_out = MD<nx,ny,nz>::slice(Vout, {lx, hx}, {ly, hy}, {lz, hz});
Traditional for-loop
for (int k=lz; k<hz; ++k)
for (int j=ly; j<hy; ++j)
for (int i=lx; i<hx; ++i)
Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];
MDRanges setup
This benchmark just tests the time of creating the two MD<2,3,2>
objects and the flat ranges without iterating over them.
Loop over flattened/joined ranges
// C++23: for (auto [o, i] : std::views::zip(flat(r_out), flat(r_in))) { o = i; }
auto r_in_flat = flat(r_in);
auto r_out_flat = flat(r_out);
auto o = r_out_flat.begin();
auto i = r_in_flat.begin();
for(; o != r_out_flat.end(); i++, o++){
*o += *i;
}
Nested loop using ranges
for (size_t x = 0; x <= hx-lx; ++x)
for (size_t y = 0; y <= hy-ly; ++y)
for (size_t z = 0; z <= hz-lz; ++z)
r_out[x][y][z] += r_in[x][y][z];
Edit 1:
I found an issue with the benchmark: The traditional loop missed some values because I used <
in the condition of the for-loops where I should've used <=
.
for (int k=lz; k<=hz; ++k)
for (int j=ly; j<=hy; ++j)
for (int i=lx; i<=hx; ++i)
Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];
With this, the difference is a little less dramatic: The nested loop using ranges is 2x slower than the traditional loop and the loop over the joined ranges 12x slower. Still, I would have hoped for a lower penalty.
Edit 2:
Motivated by @Newbies comment I ran some benchmarks using a 1x1xN array. Interestingly the first quick check showed really terrible results, where the non-joined ranges implementation was 450x slower than the native nested loop: https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU.
So I ran some tests using a 1xN array to benchmark the ranges patterns I used in the implementation:
drop_take: In the right-most dimension I simply std::views::drop
the first few elements and std::views::take
the number of elements I am looking for. This range is of the form take(drop(input_range))
. This take_drop
pattern works nicely and iterating over it is basically as fast as iterating over the input range.
iota_transform: In all other dimensions but the right-most one, I std::views::transform
the elements of std::views::iota
for each index to the range obtained from the right-neighbor dimension via recursion. So for the second-to-right dimension, we create a range of ranges of the form transform(iota, LAMBDA(take(drop(input_range))))
. The benchmark shows that this causes a doubling of calculation time (presumably because of the lack of vectorization?).
join: Not much of a "pattern" but I included a benchmark for iterating over join(transform(iota, LAMBDA(take(drop(input_range)))))
. The performance drops again by a factor of 5.5.
So maybe the iota_transform pattern is an anti-pattern? Using std::views::iota
to construct a range of ranges based on a list of indices seemed canonical to me, though the developers probably didn't have ranges as the output of std::views::transform
in mind. The actual range I want to iterate over is in the lambda expression passed to the transform, so maybe this a hard barrier for the compiler's optimizations?
But even then, it leaves the question unanswered why std::views::join
should be so much slower. I can't fathom why it would require 5.5 times as much calculation time.