1

Disclamer: I'm using Intel Compiler 2017 and if you want to know why I'm doing this, go at the end of the question.

I have this code:

class A{
  vector<float> v;
  ...
  void foo();
  void bar();
}

void A::foo(){
  for(int i=0; i<bigNumber;i++){
    //something very expensive
    //call bar() many times per cycle;
  }
}

void A::bar(){
  //...
  v.push_back(/*something*/);
}

Now, let's suppose I want to parallelize foo() since it's very expensive. However, I can't simply use #pragma omp parallel for because of v.push_back().

To my knowledge, there are two alternatives here:

  1. We use #pragma omp critical
  2. We create a local version of v for each thread and then we joint them at the end of the parallel section, more or less as explained here.

Solution 1. is often considered a bad solution because race-condition creates a consistent overhead.

However, solution 2. requires to modify bar() in this way:

class A{
  vector<float> v;
  ...
  void foo();
  void bar(std::vector<float> &local_v);
}

void A::foo(){
  #pragma omp parallel
  {
    std::vector<float> local_v;
    #pragma omp for
    for(int i=0; i<bigNumber;i++){
      //something very expensive
      //call bar(local_v) many times per cycle;
    }
    #pragma omp critical
    {
      v.insert(v.end(), local_v.begin(), local_v.end());
    }
  }
}

void A::bar(std::vector<float> &local_v){
  //...
  v.push_back(/*something*/);
}

So far so good. Now, let's suppose that there is not only v, but there are 10 vectors, say v1, v2, ..., v10, or anyway 10 shared variables. And in addition, let's suppose that that bar isn't called directly inside foo() but is called after many nested calls. Something like foo() which calls foo1(std::vector<float> v1, ..., std::vector<float> v10) which calls foo2(std::vector<float> v1, ..., std::vector<float> v10), repeating this nested calling many other times until finally the last one calls bar(std::vector<float> v1, ..., std::vector<float> v10).

So, this looks like a nightmare for maintainability (I have to modify all the headers and callings for all the nested functions)...But even more important: we agree that passing by reference is efficient, but it's always a pointer copy. As you can see, here a lot of pointers are copied many times. Is it possible that all these copies result as inefficiency?

Actually what I care most here is performance, so if you tell me "nah, it's fine because compilers are super intelligent and they do some sorcery so you can copy one trillion of references and there is no drop in performance" then it will be fine, but I don't know if such a sorcery exists or not.

Why I'm doing this: I'm trying to parallelize this code. In particular, I'm rewriting the while here as a for which can be parallelized, but if you follow the code you'll find out that the call-back onAffineShapeFound from here is called, which modify the state of the shared object keys. This happens for many others variable, but this is the "deepest" case for this code.

Community
  • 1
  • 1
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
  • Not sure if you can do it with openmp but you could also initialize the vector to the full size and then give each thread a segment of the vector to work on. Yes they would all be working on the same vector but since none of them touch the same element and none of the threads modify the state of the vector it should be thread safe. – NathanOliver Apr 12 '17 at 15:49
  • @NathanOliver Thanks for your answer. This is not possible since there is no way to know `key` 's size in advance. – justHelloWorld Apr 12 '17 at 15:50
  • "*As you can see, here a lot of pointers are copied many times. Is it possible that all these copies result as inefficiency?*"—Compared to what, exactly? The only time `void foo(T & t)` is going to be less efficient than `void foo(T t)` is if `T` is a small type like `char` or maybe even `int`. In any other situation, the performance of using references should almost always be better. – Xirema Apr 12 '17 at 15:51
  • @Xirema In the original version of `bar` there is no argument, so there is no copy at all. In the new version of `bar` there is `std::vector &local_v` as argument, or even 10 arguments as I supposed later in my question. I'm sorry if this was not clear. – justHelloWorld Apr 12 '17 at 15:53
  • You are using the wrong container for the wrong job then. Look at `libtbb`, more so at `concurrent_vector`. – Gizmo Apr 12 '17 at 16:00
  • @justHelloWorld If `foo` is as expensive as you claim, then I imagine that the overhead of passing some additional references will be comparably minimal. However, you should definitely give serious consideration to your maintainability point! – Thomas Russell Apr 12 '17 at 16:02
  • @Gizmo that push_back is called thousand of times. I have no idea how that container is implemented, but the race condition here would kill the performance, don't you think? – justHelloWorld Apr 12 '17 at 16:02
  • 1
    @justHelloWorld only profiling will show, don't underestimate Intel. Also if you don't want an ugly unmaintainable mess, use the correct tools to archieve your goal. If multithreading makes it slower and maintainable, or faster and unmaintainable, then don't change anything. Simple as that? I would hope. But `libtbb` is from Intel, you are using an Intel compiler, probably for code which is going to run on the Intel platform.. are you seriously going to roll your own solution? – Gizmo Apr 12 '17 at 16:03
  • Also, if you have a race condition, you have to worry about alot more than performance. When multithreading and accessing the same (not MT friendly) resources you need sychronization anyway. – Gizmo Apr 12 '17 at 16:35

1 Answers1

1

In a direct comparison between a::Bar() and a::Bar(std::vector<float> & v), the difference is that the second version will have to increase the size of the stack by an additional 8 bytes over what the original version has to do. In terms of performance, this is a pretty minimal effect: the stack pointer has to be adjusted no matter whether the function contains arguments or not (so the only real difference is a single pointer copy, which might even be optimized away depending on the compiler), and in terms of the actual performance of the function itself, constantly adding elements to a std::vector is going to be a far more expensive operation, especially if the vector ever needs to be reallocated (which will probably happen frequently, depending on how big the vector needs to get), which means that those costs will far exceed the costs of the pointer copy.

So, short version: Go nuts with the references.

Xirema
  • 19,889
  • 4
  • 32
  • 68
  • The point is that, as I explained in my question, this trick is done not only in one case, but with several objects, so it's not only `v`, but it's more about `v1, v2, ..., vn`. As I said in my first comments, we don't know the size of `v`, but I could `reserve` it with a reasonable amount (thanks, you gave me the idea). I think that the only real solution is implementing both solutions as someone suggested and profile both of them – justHelloWorld Apr 12 '17 at 18:13
  • What if I "incapsulate" all these arguments into one structure and pass only that as reference? There would be only one copy per calling (the pointer to the structure object), right? – justHelloWorld Apr 12 '17 at 21:07
  • @justHelloWorld Yes, but this almost certainly counts as premature optimization. Find out how expensive the original call stack is, how expensive your new call stack is, and find out if you're actually saving a significant number of CPU cycles with these abstractions. I'd put somewhere around 95% odds that these optimizations don't actually matter much. – Xirema Apr 12 '17 at 21:10