c++11 async<>, with unknown number of available cores

Question

My C++ code evaluates very large integrals on timeseries data (t2 >> t1). The integrals are fixed length and currently stored in [m x 2] column array of doubles. Column 1 is time. Column 2 is the signal that's being integrated. The code is running on a quadcore or 8 core machine.

For a machine with k cores, I want to:

Spin off k-1 worker processes (one for each of the remaining cores) to evaluate portions of the integral (trapezoidal integrations) and return their results to the waiting master thread.
Achieve the above without deep copying portions of the original array.
Implement C++11 async template for portability

How can I achieve the above without hardcoding the number of available cores?

I am Currently using VS 2012.

Update for Clarity:

For example, here's the rough psuedo-code

data is [100000,2] double

result = MyIntegrator(data[1:50000,1:2]) + MyIntegrator(data[50001:100000, 1:2]);

I need the MyIntegrator() functions to be evaluated in separate threads. The master thread waits for the two results.

I think `std::async` is one abstraction level too far for this problem. If you want to control the number of worker threads, you might be better off spawning them manually with `std::thread`. — juanchopanza, Jan 28 '13 at 19:40
@juanchopanza I think it's the other way around, what he probably needs is more abstraction like `parallel_for`. Nevertheless, the question is too vague in its current form to answer anything. Also, I think that `std::async` is always better than `std::thread` because it offers added exception safety. — Stephan Dollberg, Jan 28 '13 at 19:42
The integrals are single integrands and can be evaluated in a piece-wise fashion. — bartonm, Jan 28 '13 at 20:07

score 2 · Answer 1 · answered Jan 28 '13 at 19:35

2

What about std::thread::hardware_concurrency()?

answered Jan 28 '13 at 19:35

milianw

5,164
2
37
41

-1 for throwing random code without understanding the problem. – Stephan Dollberg Jan 28 '13 at 19:39

score 2 · Answer 2 · edited May 23 '17 at 10:24

2

Get the number of cores running, usually this can be found with std::thread::hardware_concurrency()

Returns number of concurrent threads supported by the implementation. The value should be considered only a hint.

If this is zero then you can try running specific commands based on the OS. This seems to be a good way to find out the number of cores.

You'll still need to do testing to determine if multithreading will even give you tangible benefits, remember not to optimize prematurely :)

edited May 23 '17 at 10:24

Community

1
1

answered Jan 28 '13 at 19:42

Kyle C

1,627
12
25

-1 for copying other answers and -1 for the reasons mentioned in the copied answer. – Stephan Dollberg Jan 28 '13 at 19:47
Okay, I see how it looks. I am working on a more complete answer. – Kyle C Jan 28 '13 at 20:50

score 2 · Accepted Answer · answered Jan 28 '13 at 21:01

Here is source that does a multi-threaded integration of the problem.

#include <vector>
#include <memory>
#include <future>
#include <iterator>
#include <iostream>

struct sample {
  double duration;
  double value;
};
typedef std::pair<sample*, sample*> data_range;
sample* begin( data_range const& r ) { return r.first; }
sample* end( data_range const& r ) { return r.second; }

typedef std::unique_ptr< std::future< double > > todo_item;

double integrate( data_range r ) {
  double total = 0.;
  for( auto&& s:r ) {
    total += s.duration * s.value;
  }
  return total;
}

todo_item threaded_integration( data_range r ) {
  return todo_item( new std::future<double>( std::async( integrate, r )) );
}
double integrate_over_threads( data_range r, std::size_t threads ) {
  if (threads > std::size_t(r.second-r.first))
    threads = r.second-r.first;
  if (threads == 0)
    threads = 1;
  sample* begin = r.first;
  sample* end = r.second;

  std::vector< std::unique_ptr< std::future< double > > > todo_list;

  sample* highwater = begin;

  while (highwater != end) {
    sample* new_highwater = (end-highwater)/threads+highwater;
    --threads;
    todo_item item = threaded_integration( data_range(highwater, new_highwater) );
    todo_list.push_back( std::move(item) );
    highwater = new_highwater;
  }
  double total = 0.;
  for (auto&& item: todo_list) {
    total += item->get();
  }
  return total;
}

sample data[5] = {
  {1., 1.},
  {1., 2.},
  {1., 3.},
  {1., 4.},
  {1., 5.},
};
int main() {
  using std::begin; using std::end;
  double result = integrate_over_threads( data_range( begin(data), end(data) ), 2 );
  std::cout << result << "\n";
}

it requires some modification to read data in exactly the format you specified.

But you can call it with std::thread::hardware_concurrency() as the number of threads, and it should work.

(In particular, to keep it simple, I have pairs of (duration, value) rather than (time, value), but that is just a minor detail).

This was the direction I was headed. I will review this more in detail later this evening. — bartonm, Jan 28 '13 at 21:21

score 1 · Answer 4 · answered Jan 28 '13 at 20:40

You could overschedule and see if it hurts your performance. Split your array into small fixed-length intervals (computable in one quant, may be fitting in one cache page) and see how that compares in performance with splitting according to number of CPUs.

Use std::packaged_task and pass it to a thread to make sure that you're not hurt by "launch" configuration.

Next step would be introducing thread pool, but that's more complicated.

score 0 · Answer 5 · answered Jan 28 '13 at 19:34

0

You could accept a command-line parameter for the number of worker threads.

answered Jan 28 '13 at 19:34

John

7,301
2
16
23

c++11 async<>, with unknown number of available cores

5 Answers5