Disclamer: I'm using Intel Compiler 2017 and if you want to know why I'm doing this, go at the end of the question.
I have this code:
class A{
vector<float> v;
...
void foo();
void bar();
}
void A::foo(){
for(int i=0; i<bigNumber;i++){
//something very expensive
//call bar() many times per cycle;
}
}
void A::bar(){
//...
v.push_back(/*something*/);
}
Now, let's suppose I want to parallelize foo()
since it's very expensive. However, I can't simply use #pragma omp parallel for
because of v.push_back()
.
To my knowledge, there are two alternatives here:
- We use
#pragma omp critical
- We create a local version of
v
for each thread and then we joint them at the end of the parallel section, more or less as explained here.
Solution 1. is often considered a bad solution because race-condition creates a consistent overhead.
However, solution 2. requires to modify bar()
in this way:
class A{
vector<float> v;
...
void foo();
void bar(std::vector<float> &local_v);
}
void A::foo(){
#pragma omp parallel
{
std::vector<float> local_v;
#pragma omp for
for(int i=0; i<bigNumber;i++){
//something very expensive
//call bar(local_v) many times per cycle;
}
#pragma omp critical
{
v.insert(v.end(), local_v.begin(), local_v.end());
}
}
}
void A::bar(std::vector<float> &local_v){
//...
v.push_back(/*something*/);
}
So far so good. Now, let's suppose that there is not only v
, but there are 10 vectors, say v1
, v2
, ..., v10
, or anyway 10 shared variables. And in addition, let's suppose that that bar
isn't called directly inside foo()
but is called after many nested calls. Something like foo()
which calls foo1(std::vector<float> v1, ..., std::vector<float> v10)
which calls foo2(std::vector<float> v1, ..., std::vector<float> v10)
, repeating this nested calling many other times until finally the last one calls bar(std::vector<float> v1, ..., std::vector<float> v10)
.
So, this looks like a nightmare for maintainability (I have to modify all the headers and callings for all the nested functions)...But even more important: we agree that passing by reference is efficient, but it's always a pointer copy. As you can see, here a lot of pointers are copied many times. Is it possible that all these copies result as inefficiency?
Actually what I care most here is performance, so if you tell me "nah, it's fine because compilers are super intelligent and they do some sorcery so you can copy one trillion of references and there is no drop in performance" then it will be fine, but I don't know if such a sorcery exists or not.
Why I'm doing this:
I'm trying to parallelize this code. In particular, I'm rewriting the while
here as a for
which can be parallelized, but if you follow the code you'll find out that the call-back onAffineShapeFound
from here is called, which modify the state of the shared object keys
. This happens for many others variable, but this is the "deepest" case for this code.