Multithreading Pool in C++

Question

I am developing a program which is continously receiving frames from a video stream and computing a Motion Estimated Value between each pair of frames.

Due to hardware limitations, I have to compute the Motion Estimation (ME) algorithm in CPU, something which takes about 2 seconds per computation. Because of that, I want to implement the ME algorithm with multithreading. The idea would be to receive the next frames from the stream in a main thread while the Motion Value is being calculated in other thread.

I have done it using one thread per each task, that is, every time a pair of frames is received I created a new thread for compute the Motion Value. However, due to the time elapsed in Motion Computation, a lot of threads are created and run concurrently, which I suppose it's not very efficient.

I think the best way to reimplement this is by using a thread pool. For example, in one hand having a main thread which receives the frames and store them in a buffer or queue and on the other hand having 4 or 8 threads running concurrently and reading from the reception buffer, which if I am not wrong should be protected by mutex. However, the main thread would be receiving frame much faster than one motion computation ending and I don't know how to manage that.

I am very new to C++ and new to threads, so I'd appreciate it if you can provide me some solution in pseudocode just to start my reimplementation.

Thank you so much

Well, as you said, you could use mutex! so every time you create a thread, lock the mutex, increment the ThreadCounter. then unlock mutex. Only create a mutex when the counter is less than the maximum number... — PhoenixBlue, Aug 08 '19 at 13:04
What is your strategy for skipping frames? If you just blindly start processing the first few frames that come in, you will end up in a rhythm where you only cover the first 16 frames of every second, instead of evenly sampling throughout every second. — Botje, Aug 08 '19 at 13:09
Are your incoming frames real-time, or decoded from a file? If they're real time, you are indeed going to fall way behind with a two second computation, and need some way to cope. If they're from a file, you can just stop reading when the queue is full, and wait until you catch up a bit. — Useless, Aug 08 '19 at 15:33
@Useless They are incoming frames in real-time at 30fps, but it is true that we are not going to process all of them, the idea would be to get about 5-6 fps. Thanks for the answer — anaVC94, Aug 09 '19 at 08:09
OK, so you expect to need about 10-12 threads to produce 5-6 frames per second. Do you have that many free cores? — Useless, Aug 09 '19 at 13:22

joshwilsonvu · Answer 1 · 2019-08-09T13:34:17.130

1

I would avoid using a thread pool in this case. From wikipedia (emphasis mine):

[A thread pool] increases performance and avoids latency in execution due to frequent creation and destruction of threads for short-lived tasks.

Your long-running computations dwarf the time it takes to create and destroy a thread, so creating a thread for each task seems reasonable to me. The more you can avoid mutexes and co., the better. As for running a lot of threads at once, the time it takes to switch between threads is also dwarfed by the computation time, so limiting the number of threads used would only give you a very small speedup¹.

Where you might have a problem is if your machine can't complete the computations quickly enough to keep up with incoming data. If all of your CPU cores are running at 100%, the only thing you can do is make your computations more efficient (maybe downsample your video frames?) or get more computing power.

They are incoming frames in real-time at 30fps.

¹ I should note that for real-time applications you should limit the number of threads used to the number of cores (or one or two higher, profile it). This will reduce the latency between receiving a frame and producing the result without affecting the overall performance.

edited Aug 09 '19 at 13:34

answered Aug 08 '19 at 13:22

joshwilsonvu

2,569
9
20

Depends what you call a "task." If processing one frame pair is a "task" then that's relatively short-lived. – Solomon Slow Aug 08 '19 at 13:23
1

I'm going off of "takes about 2 seconds per computation" which I took to mean processing one frame pair. Compare to the ~10 microsecond thread creation time according to https://stackoverflow.com/a/27764581/7619380. – joshwilsonvu Aug 08 '19 at 13:29
1

My problem is, I've written code for so many different systems--from big database servers to microcontrollers running at sub-MHz clock rates--that I no longer associate any particular numeric values with ideas like short/long, fast/slow, large/small. I assign abstract meanings to those concepts. When I hear "short-lived task," I understand that to mean, "does one thing, then ends." When I hear "long-running thread," I think, "Does an indefinite number of things,..." But yeah! You're right. If the program uses 2 seconds of CPU per frame, then it isn't going to run in real-time. – Solomon Slow Aug 08 '19 at 15:24
Thanks @JoshWilson We are going to use a 16RAM machine. A possible way to make the computation more efficient would be to compute the Motion Estimation in GPU, not CPU, because it takes less than 0,2 seconds per computation. As I have said, I am working with limited resources. But thank you for your approach, I thought using a Thread Pool would definitely be better than my current implementation – anaVC94 Aug 09 '19 at 08:15

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

Rough feasability estimate

... takes about 2 seconds per computation.

... having 4 or 8 threads running concurrently ...

... about 5-6 fps

Well those constraints obviously don't work.

Eight threads producing 0.5 frames per second give you at best four frames per second.

If you need 6 frames per second, you need 12 threads. Furthermore, those threads actually need to be bound to real hardware cores.

Next, you need to describe your hardware platform. If it doesn't have at least 12 cores, you can't do what you're asking, at least in the way you suggest.

If it has 12 hyperthreading "cores", that might not be sufficient either: one thread can probably saturate all your ALUs. You haven't said how big your frames are, but L1 pressure might also be a problem.

If you don't have that many cores, you either need to compute each frame faster, or compromise on output frames-per-second.

Implementation

You said you want to estimate motion between two successive frames. Does that mean successive input frames, or successive output frames?

The first case means you're sampling the input, reading two new frames for every output, which is more data but your threads can proceed in parallel:

Out₀ = ME(In₀, In₁)

Out₁ = ME(In₆, In₇)

(or ME(0,6), ME(6,12), ... or something).

The second case means you only need one input frame per output, but you can't start the second output frame until the first is completed (you're comparing the first output with the nth input frame):

Out₀ = In₀

Out₁ = ME(Out₀, In₆)

Out₂ = ME(Out₁, In₁₂)

tl;dr There are some basic things you need to clarify before you can really start coding anything.

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

You should take a look at Intel TBB Task Scheduler. You'd make each computation a task (a derived class with execute function) and let the scheduler schedule it on the available CPU for you.

The key advantage of tasks versus logical threads is that tasks are much lighter weight than logical threads. On Linux systems, starting and terminating a task is about 18 times faster than starting and terminating a thread. On Windows systems, the ratio is more than 100. This is because a thread has its own copy of a lot of resources, such as register state and a stack. On Linux, a thread even has its own process id. A task in Intel® Threading Building Blocks, in contrast, is typically a small routine, and also, cannot be preempted at the task level (though its logical thread can be preempted).

The scheduler does load balancing. In addition to using the right number of threads, it is important to distribute work evenly across those threads. As long as you break your program into enough small tasks, the scheduler usually does a good job of assigning tasks to threads to balance load. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right.

Finally, the main advantage of using tasks instead of threads is that they let you think at a higher, task-based, level. With thread-based programming, you are forced to think at the low level of physical threads to get good efficiency, because you have one logical thread per physical thread to avoid undersubscription or oversubscription. You also have to deal with the relatively coarse grain of threads. With tasks, you can concentrate on the logical dependences between tasks, and leave the efficient scheduling to the scheduler.

Alternatively, if you cannot use the library, you could implement your own task scheduler, based on those ideas. A simple implementation would be a multiple-producer-multiple-consumer queue serviced by a fixed number of a long-lived threads in the pool (for a compute thread pool you don't want more threads than the number of available CPU cores). An idle thread would wait on the queue, grab a task when one become available and execute it.

You do want more threads than cores. In particular, you want the number to be tunable so that you can experiment with it. Even if the threads are purely CPU-bound, having N+1 is standard. If there's any blocking I/O, you would definitely want more threads than cores. — Steven Sudit, Aug 09 '19 at 21:39

Multithreading Pool in C++

3 Answers3

Rough feasability estimate

Implementation