How to rewrite 3 nested loops into a single one?

Question

I have this struct:

struct Wrapper {
  public:
  const int r_range;
  const int c_range;
  ...
}

Then I have this code:

vector<Wrapper> wrappers;
//fill wrappers somehow
const int wrappersSize = wrappers.size();
for(size_t i=0; i < wrappersSize; i++)
  for(int r=0 ; r < wrappers[i].r_range; r++)
    for(int c=0; c < wrappers[i].c_range; c++)
      foo(wrappers[i], r, c);

Since performance are crucial in my application, I want to parallelize this.

The total number of foo calls is over 2 millions, but wrapperSize is only 32 and r_range (similarly c_range) differ a lot depending on i (from 16 to 1024).

As conseuquence, by parallelizing only the outer loop produces a strong load imbalance (already tested). By collapsing these 3 for loops there would be a better load balance between threads.

However, as I explained here, I can't do simply collapse all of them, because the two inner for loops depend on the first one, which is not allowed in openmp (correct me if I'm wrong). So I need to write the 3 for so their count doesn't depend on the others.

The only solution that came to my mind is kinda horrible:

vector<size_t> wrappersIndexes;
vector<int> rIndexes;
vector<int> cIndexes;
for(size_t i=0; i < wrappersSize; i++)
  for(int r=0 ; r < wrappers[i].r_range; r++)
    for(int c=0; c < wrappers[i].c_range; c++){
      wrappersIndexes.push_back(i);
      rIndexes.push_back(r);
      cIndexes.push_back(c);
    }

const size_t totalFooCalls = wrappersIndexes.size(); //necessary for parallelfor

#pragma omp parallel for
for(size_t i=0; i<totalFoolCalls; i++)
  foo(wrappers[wrappersIndexes[i]], rIndexes[i], cIndexes[i]);

Of course this can be done more efficiently by using reserve and avoiding push_back, but I wonder if there is any simpler/more efficient/more elegant/whatever solution.

afaik it is anyhow better to parallelize the outer-most loop, so I dont see any need to change something on your loops — 463035818_is_not_an_ai, Apr 18 '17 at 11:26
@tobi303 you're right, but here we are not parallelizing inner loops, but we are collapsing them, which is particularly convenient in case of strong load imbalance, which is this case! — justHelloWorld, Apr 18 '17 at 11:28
@tobi303 I've added some details to my question, hope they can help you to understand my point — justHelloWorld, Apr 18 '17 at 11:36
Your code is pretty much what I'd do. It may not be particularly elegant, but I don't think its horrible either. Sometimes you have to suffer for performance... I'd use a `std::tuple` or `struct` for the three indexes. If you're super into it, you can hide everything in a random access iterator (OpenMP does support them). — Zulan, Apr 18 '17 at 11:36
ok understand, but why do you think your solution is "horrible"? I would probably wrap the three vectors into a single struct, but otherwise I think its a nice solution — 463035818_is_not_an_ai, Apr 18 '17 at 11:36
btw you might get more feedback at https://codereview.stackexchange.com/ — 463035818_is_not_an_ai, Apr 18 '17 at 11:41
@Zulan thanks for your comment, but what do you mean by "If you're super into it, you can hide everything in a random access iterator (OpenMP does support them)."? — justHelloWorld, Apr 18 '17 at 11:45
@tobi303 Usually manually collapsing `for` loops has a different approach, (like in [this](http://stackoverflow.com/a/28483812/4480180) answer) and I was wondering if there was a solution similar to that one (where we deduce the indexes instead of explicitly save them as in my solution). That would be more time and memory efficient. — justHelloWorld, Apr 18 '17 at 11:47
@tobi303 oh and btw should I repost it [there](https://codereview.stackexchange.com/) ? I don't know if it's possible to move it since this question already contains comments. — justHelloWorld, Apr 18 '17 at 11:49
dont repost. If you want to move it you should delete it here and post it there. However, what makes a good question is quite different on review as compared to here. E.g. there only working code will be accepted. So maybe you would need to add more code to make it selfcontained. — 463035818_is_not_an_ai, Apr 18 '17 at 11:53
@justHelloWorld Reposting on CodeReview is fine. Be aware that CR requires your *actual, unaltered* code. Preferably straight from your IDE. In its' current form, this runs the chance of being closed as Pseudo/Example code. — Kaz, Apr 18 '17 at 11:55
.. and dont worry about comments. Comments anyhow can be deleted at any time and are not meant to persist forever — 463035818_is_not_an_ai, Apr 18 '17 at 11:55
Actually custom iterators aren't necessary in this case, you can just use a vector. If you hade a way to efficiently compute `r` and `c` on the fly, a custom random access iterator might be the thing to use. But they are rather bulky to implement. — Zulan, Apr 18 '17 at 12:16

score 0 · Answer 1 · answered Apr 18 '17 at 12:14

0

The simplest way to hide the "ugliness" would be to make the a single std::vector of structs in a function like so:

#include <vector>

struct WrapperIndex {
    const Wrapper& wrapper;
    const int r;
    const int c;
};

template<typename T>
auto make_wrapper_index_container(const T& wrappers) {
    std::vector<WrapperIndex> indices;
    for (const auto& wrapper : wrappers) {
        for (int r = 0; r < wrapper.r_range; r++) {
            for (int c = 0; c < wrapper.c_range; c++) {
                indices.emplace_back(WrapperIndex{wrapper, r, c});
            }
        }
    }
    return indices;
}

...

auto index_container = make_wrapper_index_container(wrappers);
#pragma omp parallel for
for (auto it = index_container.begin(); it < index_container.end(); ++it) {
    foo(it->wrapper, it->r, it->c);
}

Dynamically determining r or r would require multiple lookups to different entries in wrappers, which would be really bad for performance (non-constant index determination). Hence using some extra memory here is the right thing to do.

answered Apr 18 '17 at 12:14

Zulan

21,896
6
49
109

what if I don't want to use `const` for `wrapper`? The code above is giving me errors for `emplace_back` (which I'm not familiar with and from my understanding should be more efficient in such a context) – justHelloWorld Apr 18 '17 at 13:51
Omit the `const` if you need it, but then you have to be very careful with race conditions. `emplace_back` doesn't really matter here since I construct a temporary `WrapperIndex` anyway due to the lack of a constructor. Check out [this question](http://stackoverflow.com/a/13813268/620382) for further improvement. This code is C++14 conformant. – Zulan Apr 18 '17 at 14:00
I've tested this solution. First, I've noticed an improvement by saving in `WrapperIndex` `const int i` instead of `Wrapper & wrapper`, especially because I save a couple of others things for each `i`, so it's cheaper and faster to save just `i`. However, even by using `reserve()`, filling `indices` takes 18 milliseconds (using Intel Compiler and opitmizionation flags), which is quite a lot for my application. That's a pity. – justHelloWorld Apr 18 '17 at 15:17
@justHelloWorld My code is meant for the use with the example `foo` function. If you use `i` directly, it does of course make sense to do that. You might be able to save some setup time by using some sort of a range tree. But that will cost you with logarithmic access time. If `c_range`s are large enough on average it could theoretically be beneficial to really compute indices on the fly. That does however assume that your OpenMP implementation doesn't make excessive use of the random access to the iterator. In particular you'd want as little chunks as possible for that case. – Zulan Apr 18 '17 at 16:04
Specific performance optimization would require an [mcve] with somewhat representative values for `wrappers` and ideally a good proxy for `foo`. – Zulan Apr 18 '17 at 16:05

How to rewrite 3 nested loops into a single one?

1 Answers1