is std::packaged_task really expensive?

Question

I am surprised at the results of the following code using gcc 4.7.2 on Opensuse Linux:

#include <cmath>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <chrono>
#include <iostream>
#include <future>

int main(void)
{
  const long N = 10*1000*1000;
  std::vector<double> array(N);
  for (auto& i : array)
    i = rand()/333.;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  start = std::chrono::system_clock::now();
  for (auto& i : array)
    pow(i,i);
  end = std::chrono::system_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

  start = std::chrono::system_clock::now();
  for (auto& i : array)
    std::packaged_task<double(double,double)> myTask(pow);
  elapsed_seconds = std::chrono::system_clock::now()-start;
  std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

  start = std::chrono::system_clock::now();
  for (auto& i : array)
    std::packaged_task<double()> myTask(std::bind(pow,i,i));
  elapsed_seconds = std::chrono::system_clock::now()-start;
  std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

  return 0;
}

The results look like this (and are fairly consistent amongst runs):

elapsed time: 0.694315s
elapsed time: 6.49907s
elapsed time: 8.42619s

If I interpret the results correctly, just creating a std::packaged_task (not even executing it or storing its arguments yet) is already ten times more expensive than executing pow. Is that a valid conclusion?

Why is this so?

Is this by accident gcc specific?

Well, naturally, a packaged task contains synchronisation primitives which are expensive -- bus lock and pipeline flush are common consequences of low-level synchronisation primitives, and so single-threaded synchronised will always lose out against single-threaded unsynchronised. You have to actually be able to benefit from concurrent or parallel execution to make a concurrent solution a viable improvement. — Kerrek SB, Oct 05 '13 at 22:00
(You can build your own packaged task [with a promise](http://stackoverflow.com/q/11004273/596781). The promise alone contains some serious synchronisation mechanics.) — Kerrek SB, Oct 05 '13 at 22:01
@KerrekSB, I would be surprised if creating a packaged_task requires that many use of locking primitives. After all, there is no possible contention (yet). And creating that many mutexes just takes 0.09s - so unless you need to create ten mutexes per packaged_task, there's still a lot of room... — Klaas van Gend, Oct 05 '13 at 22:19
I would look into the implementation of `std::promise` in your library. I haven't looked myself, but I suspect that that's doing something non-trivial even upon initialization. — Kerrek SB, Oct 05 '13 at 22:20
Guessing from the implementation in libstdc++ and own tests using g++4.8.1 (at default -O0), there's a lot of time spent in the `packaged_task` case doing roughly this: `make_shared< function >( bind(..) )` and the successive destruction of it. — dyp, Oct 05 '13 at 23:01
At -O3, all the numbers are roughly the same for me. Actually, I'm surprised anything is executed at all for the loop code. — dyp, Oct 05 '13 at 23:05
[std::packaged_task constructor](http://en.cppreference.com/w/cpp/thread/packaged_task/packaged_task) invoke that 1) create a shared state (with dynamic memory allocation) and 2) store callable object as std::function-like object into shared state. Furthermore `myTask` will be destroyed immediately by end of its scope, that 3) destroy the shared state (w/ memory dealloc). That operations sequence are more expensive than calling simply `pow` function. — yohjp, Oct 06 '13 at 05:57

score 4 · Accepted Answer · answered May 25 '15 at 23:08

You are not timing the execution of a packaged_task, only its creation.

std::packaged_task<double(double,double)> myTask(pow);

This does not execute myTask, only creates it. Ideally you shouldn't be measuring this, you should be measuring myTask(i, i), which I did by changing your program to the following (I removed the measuring with std::bind).

Results are worse than what you measured:

timing raw
elapsed time: 0.578244s

timing ptask
elapsed time: 20.7379s

I guess packaged_tasks are not suitable for repeatable small tasks, the overhead is certainly greater than the task itself. My reading on this is that you should use them for multitasking code, on a task that would take longer than the overhead associated with calling and synchronizing a packaged_task.

If you're not multitasking, I think there's no point in wrapping a function call in classes ready for multithreading with synchronization primitives, they're not free, sadly.

For the record, here's what I used:

#include <cmath>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <chrono>
#include <iostream>
#include <future>
#include <thread>

int main(void)
{
  const long N = 10*1000*1000;
  std::vector<double> array(N);
  for (auto& i : array)
    i = rand()/333.;

  std::cout << "timing raw" << std::endl;
  std::chrono::time_point<std::chrono::system_clock> start, end;
  start = std::chrono::system_clock::now();
  for (auto& i : array)
    pow(i,i);
  end = std::chrono::system_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n\n";

  std::cout << "timing ptask" << std::endl;
  start = std::chrono::system_clock::now();
  std::packaged_task<double(double,double)> myTask(pow);
  for (auto& i : array)
  {
      myTask(i, i);
      myTask.get_future().wait();
      myTask.reset();
  }
  elapsed_seconds = std::chrono::system_clock::now()-start;
  std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n\n";
  return 0;
}

Hi @Leonardo, thanks for your elaborate answer. I think you underwrite the conclusion: creating a packaged_task is expensive - using it to offload work to other cores is only feasible for large amounts of work. — Klaas van Gend, May 26 '15 at 07:46

is std::packaged_task really expensive?

1 Answers1