26

This is a follow-up of this SO Answer. Given a flat input range and three size_t dimensions, the code creates a nested random_access_range of random_access_ranges of random_access_ranges, modelling a three-dimensional array.

enter image description here Quickbench

Iterating over the elements in the "multidimensional" range using a nested-for loop and indices is a bit slower than directly iterating over the elements of the input range (factor 4 slower). I suppose some performance drop can be expected, but a factor of 4 hurts a bit.

Even worse, recursively views::joining the multidimensional range back to a flat range and iterating over this flat range is slower by a factor of 20. Having read through the comments in this Github issue it can be expected that views::join will come with some additional overhead, but a factor of 20 seems a bit much.

How can this large overhead with views::join be explained? Am I using it wrong or is something fishy with my benchmark? Is there anything that can be done to speed up the code or are ranges just a poor choice for this kind of application?

Implementation

The code can be found under the Quickbench link above, I will add it here for completeness:

#include <vector>
#include <ranges>
#include <cassert>
#include <iostream>

template <size_t dim>
struct Slice {
    // default constructor leaves start at zero and end at dim. Correspondes to the whole dimension
    constexpr Slice() = default;

    // Create a slice with a single index
    constexpr Slice(size_t i) : begin(i), end(i+1) {
        assert( (0 <= i) && (i < dim));
    }

    // Create a slice with a start and an end index
    constexpr Slice(size_t s, size_t e) : begin(s), end(e+1) {
        assert( (0 <= s) && (s <= e) && (e < dim) );
    } 

    size_t begin {0};
    size_t end {dim};
};

// An adaptor object to interpret a flat range as a multidimensional array
template <size_t dim, size_t... dims>
struct MD {
    constexpr static auto dimensions = std::make_tuple(dim, dims...);
    consteval static size_t size(){
        if constexpr (sizeof...(dims) > 0) {
            return dim*(dims * ...);
        }
        else {
            return dim;
        }
    }

    // returns a multidimensional range over the elements in the flat array
    template <typename Rng>
    constexpr static auto slice(
        Rng&& range, 
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {        
        return slice_impl(range, 0, slice, slices...);
    }

    template <typename Rng>
    constexpr static auto slice_impl(
        Rng&& range, 
        size_t flat_index,  
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {
        if constexpr (std::ranges::sized_range<Rng>) { assert(std::size(range) >= size());  }
        static_assert(sizeof...(slices) == sizeof...(dims), "wrong number of slice arguments.");

        if constexpr (sizeof...(slices) == 0) 
        {
            // end recursion at inner most range
            return range | std::views::drop(flat_index*dim + slice.begin) | std::views::take(slice.end - slice.begin);
        }
        else 
        {
            // for every index to be kept in this dimension, recurse to the next dimension and increment the flat_index
            return std::views::iota(slice.begin, slice.end) | std::views::transform(
                [&range, flat_index, slices...](size_t i){
                    return MD<dims...>::slice_impl(range, flat_index*dim + i, slices...);
                }
            );
        }
    }

    // convenience overload for the full view
    template <typename Rng>
    constexpr static auto slice(Rng&& range){
        return slice(range, Slice<dim>{}, Slice<dims>{}...);
    }


};

// recursively join a range of ranges
// https://stackoverflow.com/questions/63249315/use-of-auto-before-deduction-of-auto-with-recursive-concept-based-fun
template <typename Rng>
auto flat(Rng&& rng) {
    using namespace std::ranges;

    auto joined = rng | views::join;    
    if constexpr (range<range_value_t<decltype(joined)>>) {
        return flat(joined);
    } else {
        return joined;
    }
}

Test case

Iterate over two 6x6x6 slices out of 10x10x10 arrays and add the elements of one slice to another.

// define the dimensions of a 3d-array
constexpr size_t nx{10}, ny{10}, nz{10};

// define the contents of two nx x ny x nz arrays in and out
std::vector<double> Vout(nx*ny*nz, 0.);
std::vector<double> Vin(nx*ny*nz, 1.23);

// define some slice indices for each dimension
size_t lx{0}, ly{0}, lz{0};
size_t hx{5}, hy{5}, hz{5};

auto r_in = MD<nx,ny,nz>::slice(Vin, {lx, hx}, {ly, hy}, {lz, hz});
auto r_out = MD<nx,ny,nz>::slice(Vout, {lx, hx}, {ly, hy}, {lz, hz});

Traditional for-loop

for (int k=lz; k<hz; ++k)
    for (int j=ly; j<hy; ++j)
        for (int i=lx; i<hx; ++i)
            Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

MDRanges setup

This benchmark just tests the time of creating the two MD<2,3,2> objects and the flat ranges without iterating over them.

Loop over flattened/joined ranges

// C++23: for (auto [o, i] : std::views::zip(flat(r_out), flat(r_in))) { o = i; }

auto r_in_flat = flat(r_in);
auto r_out_flat = flat(r_out);

auto o = r_out_flat.begin();
auto i = r_in_flat.begin();
for(; o != r_out_flat.end(); i++, o++){
    *o += *i;
}

Nested loop using ranges

for (size_t x = 0; x <= hx-lx; ++x)
    for (size_t y = 0; y <= hy-ly; ++y)
        for (size_t z = 0; z <= hz-lz; ++z)
            r_out[x][y][z] += r_in[x][y][z];

Edit 1:

I found an issue with the benchmark: The traditional loop missed some values because I used < in the condition of the for-loops where I should've used <=.

for (int k=lz; k<=hz; ++k)
  for (int j=ly; j<=hy; ++j)
      for (int i=lx; i<=hx; ++i)
          Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

With this, the difference is a little less dramatic: The nested loop using ranges is 2x slower than the traditional loop and the loop over the joined ranges 12x slower. Still, I would have hoped for a lower penalty.

enter image description here Quickbench

Edit 2:

Motivated by @Newbies comment I ran some benchmarks using a 1x1xN array. Interestingly the first quick check showed really terrible results, where the non-joined ranges implementation was 450x slower than the native nested loop: https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU.

So I ran some tests using a 1xN array to benchmark the ranges patterns I used in the implementation:

drop_take: In the right-most dimension I simply std::views::drop the first few elements and std::views::take the number of elements I am looking for. This range is of the form take(drop(input_range)). This take_drop pattern works nicely and iterating over it is basically as fast as iterating over the input range.

iota_transform: In all other dimensions but the right-most one, I std::views::transform the elements of std::views::iota for each index to the range obtained from the right-neighbor dimension via recursion. So for the second-to-right dimension, we create a range of ranges of the form transform(iota, LAMBDA(take(drop(input_range)))). The benchmark shows that this causes a doubling of calculation time (presumably because of the lack of vectorization?).

join: Not much of a "pattern" but I included a benchmark for iterating over join(transform(iota, LAMBDA(take(drop(input_range))))). The performance drops again by a factor of 5.5.

enter image description here Quickbench

So maybe the iota_transform pattern is an anti-pattern? Using std::views::iota to construct a range of ranges based on a list of indices seemed canonical to me, though the developers probably didn't have ranges as the output of std::views::transform in mind. The actual range I want to iterate over is in the lambda expression passed to the transform, so maybe this a hard barrier for the compiler's optimizations?

But even then, it leaves the question unanswered why std::views::join should be so much slower. I can't fathom why it would require 5.5 times as much calculation time.

joergbrech
  • 2,056
  • 1
  • 5
  • 17
  • First question: Are you benchmarking a debug build or an optimized (e.g. `-O3`) one? – tadman Nov 09 '22 at 23:05
  • 1
    O3, see the Quickbench link – joergbrech Nov 09 '22 at 23:10
  • 1
    Can you include some code here for context? – tadman Nov 09 '22 at 23:22
  • The code can also be found in the Quickbench link, but you are right, I should have added the code as part of the question for completeness. I edited the question accordingly. – joergbrech Nov 10 '22 at 05:58
  • 3
    Did you look at the asm? The only asm that makes any sense is of `TraditionalForLoop`. `MDRanges_setup` has a lower time but it doesn't do anything, `MDRanges_loop_over_joined` is a huge mess where almost all the time is spent on weird things that aren't the actual computation, `MDRanges_nested_loop` is not vectorized and has a bunch of nonsense in the loop but a bit less of a total trainwreck. – harold Nov 10 '22 at 06:13
  • Yeah, I still have a hard time interpreting asm, but I also noticed that my ranges code generates a lot of it. – joergbrech Nov 10 '22 at 06:39
  • Very interesting question. I'm curious to know where you'll end up with it. You'll get more attention if you put a bounty on it. – Enlico Nov 12 '22 at 11:12
  • 3
    I would assume foor loops get vectorized and SIMD optimized much more, where ranges are less predictable and there may be some branching to handle join logic. Furthermore ranges are structures so there is also some allocation going on. I would test incrementally bigger cuboids and also 1x1xN ones, to validate if is an intrinsic issue with the range or is an issue when having many of them compared to the number of values. You have 43 ranges for your simple 6^3 example. – Newbie Nov 12 '22 at 19:30
  • Oh wow, on a 1x1x1000 benchmark the ranges code is 450 times slower than the plain for loop, even without join. https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU – joergbrech Nov 12 '22 at 20:11
  • 1
    Upvoted for the sole reason of getting you over 1k reputation. Seriously though, optimization is hard. You can try other approaches to looping over your data, but also try different libraries for data representations. Various libraries will define their own data types, and optimize them for their algorithms and storage. Yes, finding out why it's slow is interesting (if it's not the vector insert), but does it get you closer to what you need? This sounds like you're moving in the wrong direction, but then again, knowing why it is slower may help you solve it succinctly. Good luck! – Kieveli Dec 19 '22 at 19:33
  • 1
    @Kieveli, thanks for the 1k upvote :D Yes, this really is more of an academic question, it doesn't address a "real problem" have in any of my projects right now. When I was taught C++ I was told to **always** use STL algorithms rather than my homebrew algorithms and only diverge from this rule for very good reasons. Now ranges are an amazing new feature in C++ that comes with a whole new set of algorithms and I got super excited for them. But I am now worrying that in contradiction to what I learned, I should think hard before opting into the beautiful abstractions they provide. – joergbrech Dec 19 '22 at 19:57
  • 1
    The best advice is to never prematurely optimize. Use STL, and write as you would consider a good first solution. Then, if the problem is significant, optimize carefully and with intention. Most of the time, your first instinct is sufficiently fast. – Kieveli Dec 19 '22 at 20:23
  • @joergbrech If it makes you feel better, I *never* use any STL algorithm. I feel that both the algorithms are not optimized and that the idioms are confusing and introduce unnecessary complexity. The STL itself was born out of a CS/academic idea, not from a pragmatic, engineering necessity. – Something Something Jan 11 '23 at 11:46
  • 1
    The performance of iterating over a std::ranges::views::join view in C++20 can be slow due to the underlying implementation. The join view iterates over multiple ranges in parallel and combines their elements into a single view. This requires extra overhead to keep track of the iterators for each of the ranges and to synchronize their progress. Additionally, the view must perform comparisons between the elements of the different ranges to determine which one should be advanced at each step. – Marcus Feb 02 '23 at 00:06

1 Answers1

4

I think you may find the answer to this in the talk "Multidimensional C++ - Bryce Adelstein Lelbach - CppNorth 2022" from around 30 - 45 minutes in.

Peter: Fair enough, I'll give a full answer:

In summary, most optimizers are not able to vectorize the flattened/joined iterator. He shows several approaches where the iterator either contains the current MD indices, one where it only contains the linear index (must then reconstruct the MD indices which is even worse), and one approach with coroutines, which works better on some compilers - so it requires a good coroutine implementation + compiler support.

The supposedly bad performance led to the conclusion that C++23 std::mdspan is not going to have flattened iterator support, which I think was unwise, but I guess it can be added in the future.

I happened to recently implement a multi-span type in pure C, and I got that the flattened iterator is around half the speed of the nested. This is what I would expect - and well worth it given the benefits of such a feature.

Here's a godbolt link that shows that flattened iteration performance can be good (in C at least).

tylo
  • 41
  • 3
  • This would be a link-only answer (since you didn't even summarize what sort of explanation would be there), but you didn't even include a link! Anyway, Stack Overflow answers have to at least be partly useful in case of link-rot, so at least summarize something. – Peter Cordes Feb 07 '23 at 20:27