0

I read some code for threadpool in github.

here is a good one, which revised by me.

#pragma once
#include <vector>
#include <queue>
#include <memory>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <future>
#include <functional>
#include <stdexcept>

class ThreadPool {
public:
    ThreadPool(size_t);
    template<class F, class... Args>
    auto enqueue(F&& f, Args&&... args) -> std::future<typename std::result_of<F(Args...)>::type>;  // -> apparently define return type is future
    ~ThreadPool();
private:
    // need to keep track of threads so we can join them
    std::vector< std::thread > workers;
    // the task queue
    std::queue< std::function<void()> > tasks;
    
    // synchronization
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop;
};
 
// the constructor just launches some amount of workers
inline ThreadPool::ThreadPool(size_t threads) : stop(false) {
  for(size_t i = 0;i<threads;++i)
    workers.emplace_back(  // thread variable, emplace back just pass in a function for thread to construct
      [this] {
        for(;;) {
          std::function<void()> task;
          {   
            std::unique_lock<std::mutex> lock(this->queue_mutex);
            this->condition.wait(lock, [this] { return this->stop || !this->tasks.empty(); }); 
            // wait stop true or task not empty
            if(this->stop && this->tasks.empty()) return;  // only on stop and empty task, perfect exit
            // if only stop, still will hanle left task
            task = std::move(this->tasks.front());  // get the first task
            this->tasks.pop();
          }   
          task();
        }   
      }); 
}

// add new work item to the pool
template<class F, class... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args) -> std::future<typename std::result_of<F(Args...)>::type> {
  using return_type = typename std::result_of<F(Args...)>::type;

  auto task = std::make_shared< std::packaged_task<return_type()> >(std::bind(std::forward<F>(f), std::forward<Args>(args)...));     
  std::future<return_type> res = task->get_future();
  {
    std::unique_lock<std::mutex> lock(queue_mutex);
    // don't allow enqueueing after stopping the pool
    if(stop) throw std::runtime_error("enqueue on stopped ThreadPool");
    tasks.emplace([task](){ (*task)(); });
  }
  condition.notify_one();
  return res;
}

// the destructor joins all threads
inline ThreadPool::~ThreadPool() {
  {
    std::unique_lock<std::mutex> lock(queue_mutex);
    stop = true;
  }
  condition.notify_all();
  for(std::thread &worker: workers) worker.join();
}

i use it for speeding up my code. for example, i have 1000 independent tasks, which cost about 1s for each one.

I found use this threadpool, it's slower than just one loop.

so, i write some small code to check.

here is my code.

#include "./time_util.h" // this is a timer class, just use to get time
#include "./thread_pool.hpp"  //  this is the class Threadpool header

size_t i = 0;
void test() {
  // printf("hah\n");
  i++;
}

#define N 1000000
void fun1() {
  printf("func1\n");
  ThreadPool pool(10);
  for (int i = 0; i < N; ++i) {
    pool.enqueue(test);
  }
}

void fun2() {
  printf("func2\n");
  for (int i = 0; i < N; ++i) {
    test();
  }
}

int main(int argc, char** argv) {
  util::time_util::Timer t;
  t.StartTimer();
  if (argc == 1) {
    fun1();
  } else {
    fun2();
  }
  printf("i=%d\n", i); 
  t.EndTimer("1");
}

the loop version cost 0.005s threadpool cost 5s.

my machine 's cpu is 2.2G 24core(48thread), centos7.

and, the result of threadpool is incorrect(that is ok, since i dont add lock)

can you explain this? does this mean for small task, i cant use threadpool to speed up?

nick
  • 832
  • 3
  • 12
  • 3
    Because the amount of work being done in each thread is so small and the overhead of scheduling is much larger. Put some real work in the threadproc. Also `i++;` is UB and therefore invalidates all testing. – Richard Critten May 23 '21 at 12:48
  • if my independent task is only 1s for each, does this mean i cant speed up with threadpool? (assume i cant bind 10task as one) @RichardCritten – nick May 23 '21 at 12:53
  • As with all threads the OS has to context switch, you need to work-per-thread to be 1 or more orders of magnitude greater than the context switch time plus the overhead of the threadpool itself. The ideal use-case for threadpools is IO either disk or network. – Richard Critten May 23 '21 at 12:56
  • M experience shows that the minimum unit work to be done in a thread pooling so that it is bringing full advantage is actually quite high. In the order of the time to decompress a 512x512 JPEG image for example. – prapin May 23 '21 at 12:56
  • is there any good methods, if i want to speed up my small tasks group? (500 tasks, loop to run cost 80s.)@RichardCritten – nick May 23 '21 at 13:04
  • If you have 500 tasks and executing them serially takes 80 s (160 ms per task), it looks like this is indeed a good situation for parallelizing. But for that parallelization to work good, the task must work on *independent* data. That means, they access different memory locations, at least 64 bytes away each. Otherwise, there is false (or true) sharing (https://en.wikipedia.org/wiki/False_sharing) and performance might be worse than single threaded. – prapin May 23 '21 at 13:28
  • @prapin: Note that your previous comment does not apply to read-only access to shared data. It only applies to shared data that is written to by at least one thread. – Andreas Wenzel May 23 '21 at 14:39
  • @AndreasWenzel You are absolutely right. Read-only shared data is not a problem for performance. – prapin May 23 '21 at 14:44
  • @nick: if you want more help, please explain what each of your *task* does. I am pretty sure the problem comes from there, not in `ThreadPool` implementation nor from your use of it. Ideally, if it is short and sharable, you could post the code of that task. At minimum, explain in detail what it does. I would bet on a *false sharing* problem, but there might be other explanations (like a global mutex). – prapin May 23 '21 at 14:51

1 Answers1

2

A threadpool is a very "expensive" utility. It makes sense if you have multiple heavy computation tasks, so that the overhead introduced by threadpool becomes negligible.

In your case the computation of i++ is so trivial, you are basically only measuring the overhead of your threadpools. Try replacing it with a computation that takes longer, for instance with sleep(1), to see the difference.

On a sidenote: size_t i; and i++ are not thread safe, you might get wrong results. Use atomic<size_t> i; instead.


A word of caution: thread pools are useful for some tasks like handling workers in a webserver, not really for others tasks like implementing a sorting algorithm or summing up numbers. If you want to parallelize a computation, C++17 brings parallel execution policies to many standard algorithms where all thread handling is taken care of for you. See here https://en.cppreference.com/w/cpp/algorithm and here Parallel Loops in C++

Finally, never underestimate the sheer amount of power you can squeeze out of a single core on a modern CPU.

kritzikratzi
  • 19,662
  • 1
  • 29
  • 40
  • I am working on speed up my code, it run 500 dependent tasks. if i loop to run, it cost about 80s, but when i use threadpool, it is always about 100s, which confused me a lot. can you kindly offer some suggestion if i want to speed up this kind of tasks? – nick May 23 '21 at 13:02
  • if you use 500 tasks, and each does `sleep(1)` without threadpool, then it has to take 500s (not 80s). – kritzikratzi May 23 '21 at 13:12