I have this struct:
struct Wrapper {
public:
const int r_range;
const int c_range;
...
}
Then I have this code:
vector<Wrapper> wrappers;
//fill wrappers somehow
const int wrappersSize = wrappers.size();
for(size_t i=0; i < wrappersSize; i++)
for(int r=0 ; r < wrappers[i].r_range; r++)
for(int c=0; c < wrappers[i].c_range; c++)
foo(wrappers[i], r, c);
Since performance are crucial in my application, I want to parallelize this.
The total number of foo
calls is over 2 millions, but wrapperSize
is only 32 and r_range
(similarly c_range
) differ a lot depending on i
(from 16 to 1024).
As conseuquence, by parallelizing only the outer loop produces a strong load imbalance (already tested). By collapsing these 3 for loops there would be a better load balance between threads.
However, as I explained here, I can't do simply collapse all of them, because the two inner for
loops depend on the first one, which is not allowed in openmp (correct me if I'm wrong). So I need to write the 3 for
so their count doesn't depend on the others.
The only solution that came to my mind is kinda horrible:
vector<size_t> wrappersIndexes;
vector<int> rIndexes;
vector<int> cIndexes;
for(size_t i=0; i < wrappersSize; i++)
for(int r=0 ; r < wrappers[i].r_range; r++)
for(int c=0; c < wrappers[i].c_range; c++){
wrappersIndexes.push_back(i);
rIndexes.push_back(r);
cIndexes.push_back(c);
}
const size_t totalFooCalls = wrappersIndexes.size(); //necessary for parallelfor
#pragma omp parallel for
for(size_t i=0; i<totalFoolCalls; i++)
foo(wrappers[wrappersIndexes[i]], rIndexes[i], cIndexes[i]);
Of course this can be done more efficiently by using reserve
and avoiding push_back
, but I wonder if there is any simpler/more efficient/more elegant/whatever solution.