0

My C++ code evaluates very large integrals on timeseries data (t2 >> t1). The integrals are fixed length and currently stored in [m x 2] column array of doubles. Column 1 is time. Column 2 is the signal that's being integrated. The code is running on a quadcore or 8 core machine.

For a machine with k cores, I want to:

  • Spin off k-1 worker processes (one for each of the remaining cores) to evaluate portions of the integral (trapezoidal integrations) and return their results to the waiting master thread.
  • Achieve the above without deep copying portions of the original array.
  • Implement C++11 async template for portability

How can I achieve the above without hardcoding the number of available cores?

I am Currently using VS 2012.

Update for Clarity:

For example, here's the rough psuedo-code

data is [100000,2] double

result = MyIntegrator(data[1:50000,1:2]) + MyIntegrator(data[50001:100000, 1:2]); 

I need the MyIntegrator() functions to be evaluated in separate threads. The master thread waits for the two results.

bartonm
  • 1,600
  • 3
  • 18
  • 30
  • 2
    are the separate calculations dependent on each other? – Stephan Dollberg Jan 28 '13 at 19:33
  • I think `std::async` is one abstraction level too far for this problem. If you want to control the number of worker threads, you might be better off spawning them manually with `std::thread`. – juanchopanza Jan 28 '13 at 19:40
  • 1
    @juanchopanza I think it's the other way around, what he probably needs is more abstraction like `parallel_for`. Nevertheless, the question is too vague in its current form to answer anything. Also, I think that `std::async` is always better than `std::thread` because it offers added exception safety. – Stephan Dollberg Jan 28 '13 at 19:42
  • The integrals are single integrands and can be evaluated in a piece-wise fashion. – bartonm Jan 28 '13 at 20:07

5 Answers5

2

What about std::thread::hardware_concurrency()?

milianw
  • 5,164
  • 2
  • 37
  • 41
2

Get the number of cores running, usually this can be found with std::thread::hardware_concurrency()

Returns number of concurrent threads supported by the implementation. The value should be considered only a hint.

If this is zero then you can try running specific commands based on the OS. This seems to be a good way to find out the number of cores.

You'll still need to do testing to determine if multithreading will even give you tangible benefits, remember not to optimize prematurely :)

Community
  • 1
  • 1
Kyle C
  • 1,627
  • 12
  • 25
2

Here is source that does a multi-threaded integration of the problem.

#include <vector>
#include <memory>
#include <future>
#include <iterator>
#include <iostream>

struct sample {
  double duration;
  double value;
};
typedef std::pair<sample*, sample*> data_range;
sample* begin( data_range const& r ) { return r.first; }
sample* end( data_range const& r ) { return r.second; }

typedef std::unique_ptr< std::future< double > > todo_item;

double integrate( data_range r ) {
  double total = 0.;
  for( auto&& s:r ) {
    total += s.duration * s.value;
  }
  return total;
}

todo_item threaded_integration( data_range r ) {
  return todo_item( new std::future<double>( std::async( integrate, r )) );
}
double integrate_over_threads( data_range r, std::size_t threads ) {
  if (threads > std::size_t(r.second-r.first))
    threads = r.second-r.first;
  if (threads == 0)
    threads = 1;
  sample* begin = r.first;
  sample* end = r.second;

  std::vector< std::unique_ptr< std::future< double > > > todo_list;

  sample* highwater = begin;

  while (highwater != end) {
    sample* new_highwater = (end-highwater)/threads+highwater;
    --threads;
    todo_item item = threaded_integration( data_range(highwater, new_highwater) );
    todo_list.push_back( std::move(item) );
    highwater = new_highwater;
  }
  double total = 0.;
  for (auto&& item: todo_list) {
    total += item->get();
  }
  return total;
}

sample data[5] = {
  {1., 1.},
  {1., 2.},
  {1., 3.},
  {1., 4.},
  {1., 5.},
};
int main() {
  using std::begin; using std::end;
  double result = integrate_over_threads( data_range( begin(data), end(data) ), 2 );
  std::cout << result << "\n";
}

it requires some modification to read data in exactly the format you specified.

But you can call it with std::thread::hardware_concurrency() as the number of threads, and it should work.

(In particular, to keep it simple, I have pairs of (duration, value) rather than (time, value), but that is just a minor detail).

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
1

You could overschedule and see if it hurts your performance. Split your array into small fixed-length intervals (computable in one quant, may be fitting in one cache page) and see how that compares in performance with splitting according to number of CPUs.

Use std::packaged_task and pass it to a thread to make sure that you're not hurt by "launch" configuration.

Next step would be introducing thread pool, but that's more complicated.

0

You could accept a command-line parameter for the number of worker threads.

John
  • 7,301
  • 2
  • 16
  • 23