How to properly Multithread in OpenCV in 2019?

Question

Background:

I read some articles and posts regarding Multithreading in OpenCV:

On the one hand you can build OpenCV with TBB or OpenMP support which parallelize OpenCV's functions internally.
On the other hand you can create multiple threads yourself and call the functions parallel to realize multithreading on application level.

But I couldn't get consistent answers which method of multithreading is the right way to go.

Regarding TBB, an answer from 2012 with 5 upvotes:

With WITH_TBB=ON OpenCV tries to use several threads for some functions. The problem is that just a handsome of function are threaded with TBB at the moment (may be a dozen). So, it is hard to see any speedup. OpenCV philosophy here is that application should be multi-threaded, not OpenCV functions.[...]

Regarding multithreading on application level, an comment from an moderator on answers.opencv.org:

please avoid using your own multithreading with opencv. a lot of functions are explicitly not thread-safe. rather rebuild the opencv libs with TBB or openmp support.

But another answer with 3 upvotes is stating:

The library itself is thread safe in that you can have multiple calls into the library at the same time, however the data is not always thread safe.

Problem Description:

So I thought it was at least okay to use (multi)threading on application level. But I encountered strange performance problems when running my program for longer time periods.

After investigating these performance problems I created this minimal, complete, and verifiable example code:

#include "opencv2\opencv.hpp"
#include <vector>
#include <chrono>
#include <thread>

using namespace cv;
using namespace std;
using namespace std::chrono;

void blurSlowdown(void*) {
    Mat m1(360, 640, CV_8UC3);
    Mat m2(360, 640, CV_8UC3);
    medianBlur(m1, m2, 3);
}

int main()
{
    for (;;) {
        high_resolution_clock::time_point start = high_resolution_clock::now();

        for (int k = 0; k < 100; k++) {
            thread t(blurSlowdown, nullptr);
            t.join(); //INTENTIONALLY PUT HERE READ PROBLEM DESCRIPTION
        }

        high_resolution_clock::time_point end = high_resolution_clock::now();
        cout << duration_cast<microseconds>(end - start).count() << endl;
    }
}

Actual Behavior:

If the program is running for an extended period of time the time spans printed by

cout << duration_cast<microseconds>(end - start).count() << endl;

are getting larger and larger.

After running the program for around 10 minutes the printed timespans have doubled, which is not explainable with normal fluctuations.

Expected Behavior:

The behavior of the program I would expect is that the time spans are staying pretty much constant, even tho they might be longer than calling the function directly.

Notes:

When calling the function directly:

[...]
for (int k = 0; k < 100; k++) {
    blurSlowdown(nullptr);
}
[...]

The printed time spans are staying constant.

When not calling the cv function:

void blurSlowdown(void*) {
    Mat m1(360, 640, CV_8UC3);
    Mat m2(360, 640, CV_8UC3);
    //medianBlur(m1, m2, 3);
}

The printed time spans are staying constant too. So there must be something wrong when using threading in combination with OpenCV functions.

I know that the code above does NOT achieve actual multithreading there will only be one thread active at the same time that is calling the blurSlowdown() function.
I know that creating threads and and cleaning them up afterwards is not coming free and will be slower than calling the function directly.
It is NOT about that the code is slow in general. The problem is that the printed time spans are getting longer and longer over time.
The problem is not related to the medianBlur() function since it happens on other with other functions like erode() or blur() too.
The problem was reproduced under Mac under clang++ see comment by @Mark Setchell
The problem is amplified when using the debug library instead of the release

My testing environment:

Windows 10 64bit
MSVC compiler
Official OpenCV 3.4.2 binaries

My Questions:

Is it okay to use (multi)threading on application level with OpenCV?
If yes, why are the time spans printed by my program above GROWING over time?
If no, why is OpenCV then considered thread safe and please explain how to interpret the statement from Kirill Kornyakov instead
Is TBB / OpenMP in 2019 now widely supported?
If yes, what offers better performance, multithreading on application level(if allowed) or TBB / OpenMP?

Joining the thread inside the loop effectively serializes them. You should use a `std::vector` of threads instead, fill it in the loop, and joining all of them outside the loop. Or use `std::future` and `std::promise`. — πάντα ῥεῖ, Jan 31 '19 at 17:58
No this is exactly what I want to demonstrate. Please read the whole problem description. Why are timespans printed by cout << duration_cast(end - start).count() << endl; longer and longer? They should be pretty much constant. — Crigges, Jan 31 '19 at 18:01
Your code slows down because you have a double for-loop creating 100 threads on each loop of the outer.. which seems to run infinitely lol.. All threads aren't created equal. What makes you think the time should be the same? — Brandon, Jan 31 '19 at 18:09
@Crigges; If the threads are joined immediately.. what is the point of creating a thread in the first place?! You are doing multi-threading wrong and trying to force StackOverflow to answer a question that is inherently wrong.. Again: "Not all threads are created equally or in a timely fashion".. "What gives you the idea that the times/duration of every thread for 100 loops will be the same/similar?".. What makes you think that one thread will be created in 0.5ms and the other will be created in 0.5ms instead of 1s for example.. — Brandon, Jan 31 '19 at 18:14
@Brandon This is the problem. If I don't create the threads and call the blur function directly the printed timespans stay constant. BUT with creating the threads the printed timespans are getting larger and larger. Why? It is NOT about that the threaded version is takeing longer for execution. I know that creating threads and clean them up take way more time than just calling a function. It it about that the SAME code it executed over and over again but every time it gets executed it takes more time. Why? — Crigges, Jan 31 '19 at 18:18
Interesting, I can reproduce this as well, the time seems to grow linearly (took about 6000 iterations of the outer loop to double). The memory also seems to be growing over the time, it steadily went from ~7MB to ~40MB. Some more detailed profiling might be necessary. I have some suspicions regarding spawning so many short-lived threads... might be some increasing overhead there (personally i prefer to keep a few long lived threads). — Dan Mašek, Jan 31 '19 at 18:29
@DanMašek Thank you for being able to reproduce this, really! The time keeps constant when not calling `medianBlur` and just allocating the mats. So I think the threads alone are not the problem. Additionally the timespan will grow way faster when using the debug binaries instead of the release binaries. — Crigges, Jan 31 '19 at 18:35
@Crigges; Hmm weird.. I can't reproduce it on Clang on MacOS. I will try Windows later. — Brandon, Jan 31 '19 at 19:34
@Brandon How long did you let it run? For me it takes like 10min to double the time. — Crigges, Jan 31 '19 at 19:54
Some more observations: happens with both vs12 and vs14 (64bit). `boost::thread` does it too. OpenCV 3.1.0 does it too. Other CV functions (`blur`,`erode`) do it too. Looks like the most substantial growth is between the moment `blurSlowdown` returns and the moment the `join` returns. The time between construction of `thread` object and `blurSlowdown` being called grows a little, execution time of `blurSlowdown` seems to remain constant. — Dan Mašek, Jan 31 '19 at 20:19
It might be that `cv::medianBlur` leaks memory. Looking at [this](https://github.com/opencv/opencv/issues/11449) issue, `cv::guassianBlur` had a memory leak as well. So it might be related. I would open an issue if I were you. — serkan.tuerker, Jan 31 '19 at 22:27
@SerkanT. The slowdown only happens when using threads and it is happening with other CV functions too, see Dan Mašek's answer for reference. And I already got this feedback: "please avoid using your own multithreading with opencv. a lot of functions are explicitly not thread-safe." This is why I thought maybe I am just doing something and TBB is the way to go in 2019 — Crigges, Jan 31 '19 at 22:51
It leaks memory on a Mac under clang++ too. If you comment out the `medianBlur()` though it stops leaking. So it appears to be that rather than the threading framework. — Mark Setchell, Feb 02 '19 at 11:08
@MarkSetchell But it does not leak memory / getting slower, if you call the function directly, it only happens when you combine threading with calls into the OpenCV library. — Crigges, Feb 02 '19 at 11:33
Could be related (and related issues): [Memory leak in every thread #9745](https://github.com/opencv/opencv/issues/9745). Looks like recommended way to properly use multithreading is to use thread pool. — Catree, Feb 05 '19 at 09:07
@Catree Yes looks like the same problem. I was just focusing on the performance aspect of the problem instead of taking the memory leak into account. This is why I didn't find this issue. This is already really helpful to explain what is going on. However I don't fully understand why the performance is affected that much by such a small memory leak. — Crigges, Feb 05 '19 at 10:54
@Crigges The performance aspect is most likely due to the system having to create and kill so many threads. std::thread is a stand in for the OS threading model. Creating a thread has its own costs. — TinfoilPancakes, Feb 06 '19 at 21:31
@TinfoilPancakes please read my notes: "I know that creating threads and and cleaning them up afterwards is not coming free and will be slower than calling the function directly" ... "It is NOT about that the code is slow in general. The problem is that the printed time spans are getting longer and longer over time. — Crigges, Feb 06 '19 at 21:40
@Crigges I know, but the time spans increasing is most likely a result of the OS implementation. I'll try to replicate it on my systems and see if it happens. — TinfoilPancakes, Feb 06 '19 at 21:44
@TinfoilPancakes if you comment out the OpenCV function the timespans will stay constant, even tho the threads are created. — Crigges, Feb 06 '19 at 22:00
Okay, that was my 50 cents. Generally though, it still is thread-safe while retain from proving it. If you are not sure of your copy, build it yourself and take a look at "ENABLE_IMPL_COLLECTION", "ENABLE_INSTRUMENTATION" CMake options and thread library. Also, download the current 3.4.-line. Chears, — mainactual, Feb 08 '19 at 17:54
@mainactual I am sad you deleted your answer. I think is was good and cleared up a lot of questions. With some more detail regarding the "growing time spans issue" I would have awarded the bounty to you. What I didn't understand is that even tho we both had the "Concurrency" parallel framework you wasn't able to reproduce the increasing timespans issue. — Crigges, Feb 08 '19 at 18:17
Are you sure that OpenCV doesn't have CUDA/OpenCL enabled and doesn't create new variables for each new thread? — huseyin tugrul buyukisik, Feb 10 '19 at 21:08
@huseyintugrulbuyukisik I tested it with cv::ocl::setUseOpenCL(false); but the time spans are still increasing. — Crigges, Feb 10 '19 at 21:21
I had a deeper look at this problem since frankly I'm deploying OpenCV similarly and during long periods. It seems to be the innocent looking `CV_OCL_RUN` macro in the body of many elementary functions, which --- even with `cv::ocl::useOpenCL` set to `false` --- set up the `TLSData` (supposingly) to receive runtime-linked OpenCL function addresses. Luckily `WITH_OPENCL` CMake option sets this macro 0, so it essentially solves the problem. The problem in my case being the leaking memory with `std::thread`. — mainactual, Feb 14 '19 at 10:38
@mainactual I did a rebuild with `WITH_OPENCL` disabled in cmake. The `ocl::haveOpenCL()` function is returning false but the program is still slowing down over time. Is my build invalid or is something else causing the problem? — Crigges, Apr 11 '19 at 13:26
Which VS version are you using? VS2017 (incl. the community edition) has a nice improved profiler, and comparably the VS2013 as mentioned in comments starts to be quite old. If it is thread local storage related, it'll nicely show up in memory snapshots. After this question frankly, I started using Concurrency exclusively, but of course I'd be interested to know the reason. — mainactual, Apr 11 '19 at 18:34

FutureJJ · Answer 1 · 2019-09-10T15:48:56.503

First of all, thank you for the clarity of the question.

Q: Is it okay to use (multi)threading on application level with OpenCV?

A: Yes it is totally ok to use multithreading on application level with OpenCV unless and until you are using functions which can take advantage of multithreading such as blurring, colour space changing, here you can split the image into multiple parts and apply global functions throughout the divided part and then recombine it to give the final output.

In some functions such as Hough, pca_analysis which cannot give correct results when they are applied to divided image sections and then recombined, applying multithreading on application level to such functions may not give correct results and thus should not be done.

As πάντα ῥεῖ mentioned, your implementation of multithreading will not give you an advantage because you are joining the thread in the for loop itself. I would suggest you use promise and future objects(If you want an example of how to, let me know down in the comments, I will share the snippet.

Below answer took a lot of research, thanks for asking the question, it really helps me add info to my multithreading knowledge :)

Q: If yes, why are the time spans printed by my program above GROWING over time?

A: After a lot of research I found out that creating and destroying threads takes a lot of CPU as well as memory resources. When we initialize a thread(in your code by this line: thread t(blurSlowdown, nullptr); ) an identifier is written to the memory location to which this variable points and this identifier enables us to refer to the thread. Now in your program you are creating and destroying thread at a very high rate, now this is what happens, there is a thread pool allocated to a program through which our program can run and destroy threads, I will keep it short and let's look at the explanation below:

When you create a thread, this creates an identifier which points this thread.
When you destroy the thread, this memory is freed

BUT

When you again create a thread after no time the first thread is destroyed, the identifier of this new thread points to a new location(location other than the previous thread) in the thread pool.
After repeatedly creating and destroying a thread, the thread pool is exhausted and so CPU is forced to slow down our program cycles a bit so that the thread pool is again freed for making space for a new thread.

Intel TBB and OpenMP are very good at thread pool management so this problem may not occur while using them.

Q: Is TBB in 2019 now widely supported?

A: Yes, you can take advantages of TBB in your OpenCV program while also turning on TBB support on building OpenCV.

Here is a program for TBB implementation in medianBlur:

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <iostream>
#include <chrono>

using namespace cv;
using namespace std;
using namespace std::chrono;

class Parallel_process : public cv::ParallelLoopBody
{

private:
    cv::Mat img;
    cv::Mat& retVal;
    int size;
    int diff;

public:
    Parallel_process(cv::Mat inputImgage, cv::Mat& outImage,
                     int sizeVal, int diffVal)
        : img(inputImgage), retVal(outImage),
          size(sizeVal), diff(diffVal)
    {
    }

    virtual void operator()(const cv::Range& range) const
    {
        for(int i = range.start; i < range.end; i++)
        {
            /* divide image in 'diff' number
               of parts and process simultaneously */

            cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i,
                                     img.cols, img.rows/diff));
            cv::Mat out(retVal, cv::Rect(0, (retVal.rows/diff)*i,
                                         retVal.cols, retVal.rows/diff));

            cv::medianBlur(in, out, size);
        }
    }
};

int main()
{
    VideoCapture cap(0);

    cv::Mat img, out;

    while(1)
    {
        cap.read(img);
        out = cv::Mat::zeros(img.size(), CV_8UC3);

        // create 8 threads and use TBB
        auto start1 = high_resolution_clock::now();
        cv::parallel_for_(cv::Range(0, 8), Parallel_process(img, out, 9, 8));
        //cv::medianBlur(img, out, 9); //Uncomment to compare time w/o TBB
        auto stop1 = high_resolution_clock::now();
        auto duration1 = duration_cast<microseconds>(stop1 - start1);

        auto time_taken1 = duration1.count()/1000;
        cout << "TBB Time: " <<  time_taken1 << "ms" << endl;

        cv::imshow("image", img);
        cv::imshow("blur", out);
        cv::waitKey(1);
    }

    return 0;
}

On my machine, TBB implementation takes around 10ms and w/o TBB it takes around 40ms.

Q: If yes, what offers better performance, multithreading on the application level(if allowed) or TBB / OpenMP?

A: I would suggest using TBB/OpenMP over POSIX multithreading(pthread/thread) because TBB offers you better control over thread + better structure for writing parallel code and internally it manages pthreads. In case if you use pthreads you will have to take care of sync and safety etc in your code. But using these framework abstracts the need for handling thread which may get very complex.

Edit: I checked the comments regarding the incompatibility of image dimensions with the number of thread in which you want to divide the processing. So here is a potential workaround(haven't tested but should work), scale the image resolution to the compatible dimensions like:

If your image res is 485 x 647, scale it to 488 x 648 then pass it to Parallel_process then scale back the output to the original size of 458 x 647.

For comparison of TBB and OpenMP check this answer

Edit: Answered the question of growing time for threaded execution. — FutureJJ, Mar 20 '19 at 14:40
Great answer, thanks. Does TBB automatically use thread pools or do I somehow have to initialize them manually beforehand? I am experiencing the same behaviour as in the question, however, at no point of my program I am manually creating threads. I only use `cv::Mat::forEach` or `tbb::parallel_for` directly. I did not further investigate it because the growth is minimal (from 12 to 14 seconds after ~300 consequtive runs in my case). If not, I might want to dig deeper on this some day. :) — Carsten, Mar 23 '19 at 13:25
You may want to check Intel [Software Development Zone forum](https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/302637), for an elaborate answer to whether TBB automatically uses thread pools. — FutureJJ, Mar 23 '19 at 20:21
The time growth of 14s shouldn't have happened while using TBB parallel_for, can you please share your program? I may be able to solve it if there is an issue. — FutureJJ, Mar 23 '19 at 20:25
It's more like a set of tools and libraries, I'm currently developing as a research project. I don't want to bother you with the details, since it's rather complex. I've instead created a [minimal example](https://git.io/fjJj2). Using it, I could not replicate the behavior, so it seems the issue originates from some other side effect. As I said, I did not further investigate this yet, since the actual growth is so minimal. Thanks anyway. :-) — Carsten, Mar 25 '19 at 12:24
Thank you for your answer. This does not explain why the timespans are staying constant when not calling a cv function inside the thread. If threadpool exhausting is the problem then the the timespans should grow too in that case. See the second Note in my question. — Crigges, Apr 06 '19 at 06:13
This will require some research on what do those OpenCV functions actually do, I will get back to you as soon as I find some information enough to answer. In meanwhile I suggest you add this question of constant time in the "My Questions" section in your post, This question really didn't catch my attention. — FutureJJ, Apr 11 '19 at 12:41
Couple notes though: if the image height is not divisible by 8 in this example, you will miss some scanlines by rounding. Also, `cv::medianBlur` uses `BORDER_REPLICATE` at each seam, which causes ... well, seams in the image. Other filters may use other bordermode, but all in all, you need to expand each region to process, and then crop the requested region to `retVal` to do it correctly. — mainactual, Apr 11 '19 at 19:50