1

I have a situation in which I need to do some heavy computation. I found out that subdividing my data and then merging it back together is the fastest (as the size increases, time increases faster, so splitting is logical).

It should be able to give a data size to the application, let's say for example one million double values.

What I have in place now, is sending created data based on this size off to some function, returning it after computation, and then looping over the return to unload this data into the main vector.

I want to send parts of 200, with one "last" part. For example, giving size = 1000005 will perform this function 5000 times initially, and then the last one with data of size 5.

int size = 1000000;
int times = size / 200; // 5000
int leftover = size % 200; // 0, this not performed

QVector<double> x(size);
QVector<double> y(size);

x = createData(size);
x = createData(size);

for (int i = 0; i < times; i++)
{
    holder = createData(200);
    QVector<double> tempx = x.mid(i*200, 200);
    QVector<double> tempy = y.mid(i*200, 200);
    holder = myfunction(tempx, tempy, 200);  // let it now just return `tempy`
    for (int j = 0; j < 200; j++)
    {
        y[i*200 + j] = holder[j];
    }
}
// leftover function here, really similar to this part before.

// plotting function here

At the end, x will remain as initialized, y will have had the computation upon.

Since these code parts can run apart from each other and speed is crucial, I would like to make use of several cores.

The following further characterizes the situation:

  • These function calls are independent of each other, only in the end when the vectors are complete do I want to plot the result.
  • The time of completion for each call will be varying a lot.
  • The amount of times should be variable.

I read something about that maximum threads are advised to be the amount of cores (at least as a starting point), since using too many threads could slow the process down. Considering the situation a queueing system / threadpool would seem to make sense as not to lose time while one thread has some easy jobs and the others are slowing everything down by harder jobs.

While it seems easy to print some messages using some (usually 2) threads in some dozen of tutorials, could anyone provide more detailed help on how to return vectors and unload these thread safely into a main function, and how to create a threadpool so time would not be wasted?

Using Ubuntu 13.04, Qt and C++11x, though it should not matter.

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • Why not use intel thread building block, which has "parallel_for" and excellent thread pool (and also allows you to fine tune the pool adjustment), instead of manually split the calculation? With intel tbb you specify the number of threads in main and the library will try to ajust the pool in order to give the same work load to each thread – Vivian Miranda Aug 31 '13 at 21:43
  • @ViniciusMiranda I am not sure to what this is? Is there any documentation or might you try and give an answer? – PascalVKooten Aug 31 '13 at 22:36

3 Answers3

4

First of all, write a tread pool is hard. If you really want to learn how to write one, the book C++ Concurrency in Action written by Antony Williams teach you how to accomplish that.

However, your case seems to be a situation where a simple parallel_for will fit perfectly. So I suggest using Intel Threading Building Blocks library . The advantage of that library is that it has a very good thread pool and works quite nicely with C++11 features.

Example code:

#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/tbb_thread.h"
#include <vector>

int main() {
  tbb::task_scheduler_init init(tbb::tbb_thread::hardware_concurrency());
  std::vector<double> a(1000);
  std::vector<double> c(1000);
  std::vector<double> b(1000);

  std::fill(b.begin(), b.end(), 1);
  std::fill(c.begin(), c.end(), 1);

  auto f = [&](const tbb::blocked_range<size_t>& r) {
    for(size_t j=r.begin(); j!=r.end(); ++j) a[j] = b[j] + c[j];    
  };
  size_t hint_number_iterations_per_thread = 100;
  tbb::parallel_for(tbb::blocked_range<size_t>(0, 1000, hint_number_iterations_per_thread), f);
  return 0;
}

Done! Intel TBB has a very good thread pool that will try to ajust the workload of each thread. As long as hint_number_iterations_per_thread is not a crazy number, it will be very close to the optimal solution

By the way: intel TBB is a open source library that work with the majority of compilers!

liborm
  • 2,634
  • 20
  • 32
Vivian Miranda
  • 2,467
  • 1
  • 17
  • 27
  • @Dualinity Now I included a complete example. Easier to adapt to your code – Vivian Miranda Sep 01 '13 at 00:37
  • Thanks a lot. This seems interesting. I couldn't get tbb to be installed yet (though I did "make" the source), Qt is not recognizing it yet. Also, it does seem like some complex code. Could you add what "blocked_range" and "hardware concurrency" are? – PascalVKooten Sep 01 '13 at 08:10
  • I got it installed using the simple `sudo apt-get install libtbb-dev` – PascalVKooten Sep 01 '13 at 08:52
  • Also, this code does not compile for me, `f` does not name a type. – PascalVKooten Sep 01 '13 at 09:31
  • I tested this program and it does compile. Caveats: Ubuntu rep. has a very old version of tbb which I am not sure if it is compatible with C++11. Also dont forget to compile with c++11 flags in QT (whatever they are). @RobbieE seems to be a good way for QT. For C++ in general, however, I always recommend tbb because it is a very powerful library (they new graph capabilities is mind blowing for example!) :) . Run with gcc 4.6/4.7 and up to date tbb to see the example working – Vivian Miranda Sep 01 '13 at 17:11
  • 1
    Pray tell, how is this remotely relevant to Qt? You should be providing an answer that uses Qt APIs. – Kuba hasn't forgotten Monica Sep 02 '13 at 17:38
  • 1
    @KubaOber - First, intel TBB can be used with QT. It is a interesting alternative, especially given how sophisticated/optimized and simple is TBB thread pool (why QT people can't use external lib?). Second and more important, he mention that he uses QT but he ask about C++ thread pool in general. And his example fits perfectly in a C++ situation that requires parallel_for. Third, yes there is qthreapool but it was far from clear in the other answer how to write a parallel_for with automatic pool, which is what he really needs. – Vivian Miranda Sep 02 '13 at 18:41
  • Given the answers, he has two alternatives that he can test to see which one gives better performance. If the answer needs to be restricted to only QT - he should made this more explicit and tag only QT. – Vivian Miranda Sep 02 '13 at 18:44
  • I agree with Kuba in here. In addition, you do not seem to show any integration that you claim to be done easily. – László Papp Apr 25 '14 at 17:05
1

You don't need to create anything. If you're using Qt, your problem has already been solved. You can derive a class from QRunnable and then pass it on to QThreadPool to be executed.

You can instruct QThreadPool on how many threads should run concurrently (any extras just wait in a queue till a slot opens up) but this shouldn't be necessary since QThreadPool sets limits based on your architecture that are usually good enough.

QThreadPool

QRunnable

RobbieE
  • 4,280
  • 3
  • 22
  • 36
0

Even simpler than creating a QThreadPool and extending QRunabble, you can use the QtConcurrent library. Specifically use the QtConcurrent::mapped function, which takes a begin iterator and an end iterator, and a function (which can be a lambda), and internally handles the creation of, and execution on, a thread pool for you.

There are two variations: "mapped" returns a QFuture to the results but does not block the current thread, while "blockingMapped" directly returns a list of the results.

To square a large vector of integers, you could do the following:

std::vector<int> myInts = ....

QVector<int> result = QtConcurrent::blockingMapped(myInts.begin(), myInts.end(), [](int x) { return x*x}; });
Elliott
  • 1,336
  • 1
  • 13
  • 26