Why Threadpool not speed up but slow a lot?

Question

I read some code for threadpool in github.

here is a good one, which revised by me.

#pragma once
#include <vector>
#include <queue>
#include <memory>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <future>
#include <functional>
#include <stdexcept>

class ThreadPool {
public:
    ThreadPool(size_t);
    template<class F, class... Args>
    auto enqueue(F&& f, Args&&... args) -> std::future<typename std::result_of<F(Args...)>::type>;  // -> apparently define return type is future
    ~ThreadPool();
private:
    // need to keep track of threads so we can join them
    std::vector< std::thread > workers;
    // the task queue
    std::queue< std::function<void()> > tasks;
    
    // synchronization
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop;
};
 
// the constructor just launches some amount of workers
inline ThreadPool::ThreadPool(size_t threads) : stop(false) {
  for(size_t i = 0;i<threads;++i)
    workers.emplace_back(  // thread variable, emplace back just pass in a function for thread to construct
      [this] {
        for(;;) {
          std::function<void()> task;
          {   
            std::unique_lock<std::mutex> lock(this->queue_mutex);
            this->condition.wait(lock, [this] { return this->stop || !this->tasks.empty(); }); 
            // wait stop true or task not empty
            if(this->stop && this->tasks.empty()) return;  // only on stop and empty task, perfect exit
            // if only stop, still will hanle left task
            task = std::move(this->tasks.front());  // get the first task
            this->tasks.pop();
          }   
          task();
        }   
      }); 
}

// add new work item to the pool
template<class F, class... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args) -> std::future<typename std::result_of<F(Args...)>::type> {
  using return_type = typename std::result_of<F(Args...)>::type;

  auto task = std::make_shared< std::packaged_task<return_type()> >(std::bind(std::forward<F>(f), std::forward<Args>(args)...));     
  std::future<return_type> res = task->get_future();
  {
    std::unique_lock<std::mutex> lock(queue_mutex);
    // don't allow enqueueing after stopping the pool
    if(stop) throw std::runtime_error("enqueue on stopped ThreadPool");
    tasks.emplace([task](){ (*task)(); });
  }
  condition.notify_one();
  return res;
}

// the destructor joins all threads
inline ThreadPool::~ThreadPool() {
  {
    std::unique_lock<std::mutex> lock(queue_mutex);
    stop = true;
  }
  condition.notify_all();
  for(std::thread &worker: workers) worker.join();
}

i use it for speeding up my code. for example, i have 1000 independent tasks, which cost about 1s for each one.

I found use this threadpool, it's slower than just one loop.

so, i write some small code to check.

here is my code.

#include "./time_util.h" // this is a timer class, just use to get time
#include "./thread_pool.hpp"  //  this is the class Threadpool header

size_t i = 0;
void test() {
  // printf("hah\n");
  i++;
}

#define N 1000000
void fun1() {
  printf("func1\n");
  ThreadPool pool(10);
  for (int i = 0; i < N; ++i) {
    pool.enqueue(test);
  }
}

void fun2() {
  printf("func2\n");
  for (int i = 0; i < N; ++i) {
    test();
  }
}

int main(int argc, char** argv) {
  util::time_util::Timer t;
  t.StartTimer();
  if (argc == 1) {
    fun1();
  } else {
    fun2();
  }
  printf("i=%d\n", i); 
  t.EndTimer("1");
}

the loop version cost 0.005s threadpool cost 5s.

my machine 's cpu is 2.2G 24core(48thread), centos7.

and, the result of threadpool is incorrect(that is ok, since i dont add lock)

can you explain this? does this mean for small task, i cant use threadpool to speed up?

Because the amount of work being done in each thread is so small and the overhead of scheduling is much larger. Put some real work in the threadproc. Also `i++;` is UB and therefore invalidates all testing. — Richard Critten, May 23 '21 at 12:48
if my independent task is only 1s for each, does this mean i cant speed up with threadpool? (assume i cant bind 10task as one) @RichardCritten — nick, May 23 '21 at 12:53
As with all threads the OS has to context switch, you need to work-per-thread to be 1 or more orders of magnitude greater than the context switch time plus the overhead of the threadpool itself. The ideal use-case for threadpools is IO either disk or network. — Richard Critten, May 23 '21 at 12:56
M experience shows that the minimum unit work to be done in a thread pooling so that it is bringing full advantage is actually quite high. In the order of the time to decompress a 512x512 JPEG image for example. — prapin, May 23 '21 at 12:56
is there any good methods, if i want to speed up my small tasks group? (500 tasks, loop to run cost 80s.)@RichardCritten — nick, May 23 '21 at 13:04
If you have 500 tasks and executing them serially takes 80 s (160 ms per task), it looks like this is indeed a good situation for parallelizing. But for that parallelization to work good, the task must work on *independent* data. That means, they access different memory locations, at least 64 bytes away each. Otherwise, there is false (or true) sharing (https://en.wikipedia.org/wiki/False_sharing) and performance might be worse than single threaded. — prapin, May 23 '21 at 13:28
@prapin: Note that your previous comment does not apply to read-only access to shared data. It only applies to shared data that is written to by at least one thread. — Andreas Wenzel, May 23 '21 at 14:39
@AndreasWenzel You are absolutely right. Read-only shared data is not a problem for performance. — prapin, May 23 '21 at 14:44
@nick: if you want more help, please explain what each of your *task* does. I am pretty sure the problem comes from there, not in `ThreadPool` implementation nor from your use of it. Ideally, if it is short and sharable, you could post the code of that task. At minimum, explain in detail what it does. I would bet on a *false sharing* problem, but there might be other explanations (like a global mutex). — prapin, May 23 '21 at 14:51

kritzikratzi · Answer 1 · 2021-05-23T13:58:02.993

A threadpool is a very "expensive" utility. It makes sense if you have multiple heavy computation tasks, so that the overhead introduced by threadpool becomes negligible.

In your case the computation of i++ is so trivial, you are basically only measuring the overhead of your threadpools. Try replacing it with a computation that takes longer, for instance with sleep(1), to see the difference.

On a sidenote: size_t i; and i++ are not thread safe, you might get wrong results. Use atomic<size_t> i; instead.

A word of caution: thread pools are useful for some tasks like handling workers in a webserver, not really for others tasks like implementing a sorting algorithm or summing up numbers. If you want to parallelize a computation, C++17 brings parallel execution policies to many standard algorithms where all thread handling is taken care of for you. See here https://en.cppreference.com/w/cpp/algorithm and here Parallel Loops in C++

Finally, never underestimate the sheer amount of power you can squeeze out of a single core on a modern CPU.

I am working on speed up my code, it run 500 dependent tasks. if i loop to run, it cost about 80s, but when i use threadpool, it is always about 100s, which confused me a lot. can you kindly offer some suggestion if i want to speed up this kind of tasks? — nick, May 23 '21 at 13:02
if you use 500 tasks, and each does `sleep(1)` without threadpool, then it has to take 500s (not 80s). — kritzikratzi, May 23 '21 at 13:12

Why Threadpool not speed up but slow a lot?

1 Answers1