2

I have folder of images and I perform sequence of some basic operations on them:

  1. Load source image.
  2. Perform some image processing on image.
  3. Save result image.

So I want to process each image in separate thread to speedup processing.

Here is my example code:

ThreadExample.h

#include <thread>


    class ThreadProcessing
    {
         static unsigned int concurentThreadsSupported;

         static void ImageProcessingFunction(const std::string &input_dir, const std::string &filename);
    public:
         void PrintNumberOfCPU();
         void MultithreadingProcessing(const std::string &dir, int N);
         void SingleThreadProcessing(const std::string &dir);
    };

ThreadExample.cpp

    #include "ThreadExample.h"

    unsigned int ThreadProcessing::concurentThreadsSupported = std::thread::hardware_concurrency();

    using namespace std;

    void ThreadProcessing::PrintNumberOfCPU()
    {
        cout << "Number of CPU : " << concurentThreadsSupported << endl;
    }

void ThreadProcessing::ImageProcessingFunction(const string &input_dir, const string &filename)
    {
        Mat src= imread(input_dir+"/"+filename);
        Mat dst;

        for(int i=0; i<10; ++i)
        {
            medianBlur(src, dst, 71);
        }

        boost::filesystem::path name= path(filename).stem();

        string output_filename= (input_dir/name).string()+"_output.png";
        imwrite(output_filename, dst);
    }

    void ThreadProcessing::SingleThreadProcessing(const string &dir)
    {
        time_t SingleThreadProcessingTime = clock();

        vector<string> imageNames= GetAllFilenamesInDir(dir, ".jpg");

        for(int i=0; i<(int)imageNames.size(); ++i)
        {
            ImageProcessingFunction(dir, imageNames[i]);
        }

        SingleThreadProcessingTime = clock() - SingleThreadProcessingTime;
        cout << "SingleThreadProcessingTime : " << (float(SingleThreadProcessingTime) / CLOCKS_PER_SEC) << endl;
    }

    void ThreadProcessing::MultithreadingProcessing(const string &dir, int N)
    {
        time_t MultithreadingProcessingTime = clock();

        std::thread threads[N];

        bool isAllImageProcessed= false;
        vector<string> imageNames= GetAllFilenamesInDir(dir, ".jpg");

        for(int i=0; i<(int)imageNames.size();)
        {
            //Launch a group of threads
            for(int k= 0; k< N; ++k)
            {
                threads[k] = std::thread(ImageProcessingFunction, dir, imageNames[i]);

                i++;

                if(i>=(int)imageNames.size())
                {
                    N= k+1;
                    isAllImageProcessed= true;
                    break;
                }
            }

            //Join the threads with the main thread
            for(int k= 0; k< N; ++k)
            {
                threads[k].join();
            }

            if(isAllImageProcessed)
                break;
        }

        MultithreadingProcessingTime = clock() - MultithreadingProcessingTime;
        cout << "MultithreadingProcessingTime : " << (float(MultithreadingProcessingTime) / CLOCKS_PER_SEC) << endl;
    }

main.cpp

int main(int argc, char** argv)
{
    ThreadProcessing threadProcessing;
    threadProcessing.PrintNumberOfCPU();
    threadProcessing.SingleThreadProcessing("/home/user/Images");
    threadProcessing.MultithreadingProcessing("/home/user/Images", 1);
    cout << "Done." << endl;
    return 0;
}

But it seems there is no speed improvement:

When I use 1 thread, output is:

Number of CPU : 8
SingleThreadProcessingTime : 6.54173
MultithreadingProcessingTime : 6.73393
Done.

When I use 4 threads, output is:

Number of CPU : 8
SingleThreadProcessingTime : 6.39089
MultithreadingProcessingTime : 8.3365
Done.

Is there is something wrong with my code or something conceptually wrong?

Update:

Also I tried 2 variants:

  1. 1 thread per image - seems this approach can hit max number of threads OS limit? And also inefficient?

Code:

void ThreadProcessing::SingleThreadForEachImage(const string &dir)
{
    time_t SingleThreadForEachImageTime = clock();

    vector<string> imageNames= GetAllFilenamesInDir(dir, ".jpg");

    int N= imageNames.size();
    std::thread threads[imageNames.size()];

    for(int i=0; i<N; ++i)
    {

        threads[i] = std::thread(ImageProcessingFunction, dir, imageNames[i]);
    }

    for(int i=0; i<N; ++i)
    {
        threads[i].join();
    }

    SingleThreadForEachImageTime = clock() - SingleThreadForEachImageTime;
    cout << "SingleThreadForEachImageTime : " << (float(SingleThreadForEachImageTime) / CLOCKS_PER_SEC) << endl;
}
  1. Split images to N chunks and process each chunk in separate thread.

Code:

vector<vector<string>> ThreadProcessing::SplitNamesVector(const vector<string> &imageNames, int N)
{
    vector<vector<string>> imageNameChunks;

    int K=0; //Number images in chunk
    if(imageNames.size()%N==0)
        K= imageNames.size()/N;
    else
        K= imageNames.size()/N+1;

    vector<string> chunk;
    for(int i=0; i<(int)imageNames.size(); ++i)
    {
        chunk.push_back(imageNames[i]);
        if(i%K==0 && i!=0)
        {
            imageNameChunks.push_back(chunk);
            chunk.clear();
        }
    }

    if(chunk.size()!=0)
        imageNameChunks.push_back(chunk);

    assert((int)imageNameChunks.size()==N);

    return imageNameChunks;
}

void ThreadProcessing::EachThreadProcessChunkOfImages(const std::string &dir, int N)
{
    time_t EachThreadProcessChunkOfImagesTime = clock();

    N= std::min(N, (int)concurentThreadsSupported);
    std::thread threads[N];

    vector<string> imageNames= GetAllFilenamesInDir(dir, ".jpg");
    vector<vector<string>> imageNameChunks= SplitNamesVector(imageNames, N);

    //Launch a group of threads
    for(int k= 0; k< N; ++k)
    {
        threads[k] = std::thread(ImageProcessingFunctionChunk, dir, imageNameChunks[k]);
    }

    for(int k= 0; k< N; ++k)
    {
        threads[k].join();
    }

    EachThreadProcessChunkOfImagesTime = clock() - EachThreadProcessChunkOfImagesTime;
    cout << "EachThreadProcessChunkOfImagesTime : " << (float(EachThreadProcessChunkOfImagesTime) / CLOCKS_PER_SEC) << endl;
}

Here is results (MultithreadingProcessing and EachThreadProcessChunkOfImages use 4 threads):

SingleThreadProcessingTime : 13.552
MultithreadingProcessingTime : 15.581
SingleThreadForEachImageTime : 26.7727
EachThreadProcessChunkOfImagesTime : 15.9078

UPDATE 2:

I also make a test without IO opeartions, only image processing.

Code:

void ThreadProcessing::ImageProcessingFunction(const cv::Mat &img)
{
    Mat dst;

    for(int i=0; i<10; ++i)
    {
        medianBlur(img, dst, 71);
    }
}
    vector<Mat> ThreadProcessing::LoadBatchOfImages(const std::string &dir, int batchSize)
    {
        vector<string> imageNames= GetAllFilenamesInDir(dir, ".jpg");

        vector<Mat> imageVec;
        for(int i=0; i<N; ++i)
        {
            string filename= dir+"/"+imageNames[i];
            Mat img= imread(filename);
            imageVec.push_back(img);
        }

        return imageVec;
    }

    void ThreadProcessing::OnlyProcessingTimeSequential(const std::string &dir, int batchSize)
    {
        //Load batch of images
        vector<Mat> imageVec= LoadBatchOfImages(dir, batchSize);

        assert((int)imageVec.size() == batchSize);
        cout << "imageVec.size() : " << imageVec.size() << endl;

        time_t OnlyProcessingTimeSequentialTime = clock();

        for(int i=0; i<batchSize; ++i)
        {
            ImageProcessingFunction(imageVec[i]);
        }

        OnlyProcessingTimeSequentialTime = clock() - OnlyProcessingTimeSequentialTime;
        cout << "OnlyProcessingTimeSequentialTime : " << (float(OnlyProcessingTimeSequentialTime) / CLOCKS_PER_SEC) << endl;
    }

    void ThreadProcessing::OnlyProcessingTimeMultithread(const std::string &dir, int batchSize)
    {
        //Load batch of images
        vector<Mat> imageVec= LoadBatchOfImages(dir, batchSize);

        assert((int)imageVec.size() == batchSize);
        cout << "imageVec.size() : " << imageVec.size() << endl;

        time_t OnlyProcessingTimeMultithread = clock();

        std::thread threads[batchSize];
        for(int i=0; i<batchSize; ++i)
        {
            threads[i] = std::thread(ImageProcessingFunction, imageVec[i]);
        }

        for(int i=0; i<batchSize; ++i)
        {
            threads[i].join();
        }

        OnlyProcessingTimeMultithread = clock() - OnlyProcessingTimeMultithread;
        cout << "OnlyProcessingTimeMultithread : " << (float(OnlyProcessingTimeMultithread) / CLOCKS_PER_SEC) << endl;
    }

I figure out that clock() gives wrong results when multithreaded code is used, so I use time ./MyBinary

Here is results:

imageVec.size() : 8
OnlyProcessingTimeSequentialTime : 2.34174
Done.

real    0m2.551s
user    0m2.640s
sys 0m0.316s


imageVec.size() : 8
OnlyProcessingTimeMultithread : 4.36681
Done.

real    0m0.861s
user    0m4.564s
sys 0m0.404s

As we can see real time is smaller.

So previous results should be:

SingleThreadProcessingTime : 13.6235

real    0m13.845s
user    0m13.932s
sys 0m0.280s

MultithreadingProcessingTime : 21.0902

real    0m3.584s
user    0m20.356s
sys 0m1.316s

SingleThreadForEachImageTime : 23.961

real    0m3.370s
user    0m22.584s
sys 0m1.976s

EachThreadProcessChunkOfImagesTime : 20.7885

real    0m3.433s
user    0m20.292s
sys 0m1.116s

So how multithreaded code execution time should be measured?

Update: found answer here

I need to use wallclock time and not cpu time

Here is correct results:

SingleThreadProcessing :  WALLCLOCK TIME: 13.8245 seconds
MultithreadingProcessing :  WALLCLOCK TIME: 4.1977 seconds
SingleThreadForEachImage :  WALLCLOCK TIME: 3.25084 seconds
EachThreadProcessChunkOfImages :  WALLCLOCK TIME: 3.36626 seconds
OnlyProcessingTimeSequential :  WALLCLOCK TIME: 2.36041 seconds
OnlyProcessingTimeMultithread :  WALLCLOCK TIME: 0.706921 seconds
Community
  • 1
  • 1
mrgloom
  • 20,061
  • 36
  • 171
  • 301
  • 2
    For one, starting one thread per image, and then waiting for all of them to finish, before you do that again for another group of images is really inefficient. Why don't you just split the list of files into 8 pieces, and give each thread the whole section of the list? Also, please make your example minimal -- there's no need for all those commented out lines. – Dan Mašek May 23 '16 at 19:16
  • @DanMašek I tried your suggestion, no speed up, see update. – mrgloom May 24 '16 at 09:25
  • 1
    I'll look into this a little more later today. But at this point: "1 thread per image" -- yes, that one is a poor choice. It's not about OS limit, it's about the hardware -- you only have so many CPU cores, so if there are more tasks that that, they have to be switched around to appear to run all at once and the switch takes time. Each thread also needs memory, and they will start trashing each others cache leading to further slowdowns, etc. Your first idea, of only running number of threads adequate for the current CPU counts was good. Did you look at CPU usage graph? What does it look like? – Dan Mašek May 24 '16 at 10:42
  • Right, good catch about [`clock()`](http://en.cppreference.com/w/cpp/chrono/c/clock). _"... On the other hand, if the current process is multithreaded and more than one execution core is available, `std::clock` time may advance faster than wall clock."_ – Dan Mašek May 24 '16 at 12:37
  • Just out of curiosity, you're doing `SingleThreadForEachImageTime` only with 8 images to test this? Just for an educational purpose try it with many more, like 64, and see how the relative performance changes -- you seem to have quad core CPU with hyperthreading, so 8 threads is still a reasonable number. Even for the rest of the test, a much higher number of test images would be appropriate to get a decent measurement. – Dan Mašek May 24 '16 at 12:42

1 Answers1

1

As was clearly pointed out in this question, multithreading is really not efficient when it comes to I/O operations on a single disk. And most of what your threads do is exactly I/O operations.

You're probably being limited by your disk speed and suffering from the overhead due to the creation of threads and the join() operations which contribute to slowing down the multithreaded version of your function.

EDIT:

As remarked by @Dan Mašek in his comment, most of the time is actually spent on the compressing. One way of improving your process would be to create one thread that reads the images from the disk and feeds them to other working threads (maybe through a queue, see Thread pool). This way, you read sequentially but the heavy duty is done by a lot of workers.

Community
  • 1
  • 1
Sunreef
  • 4,452
  • 21
  • 33
  • Maybe there is a way to separate IO operation and image processing operation in some smart way? i.e. some thread load images and some process them? – mrgloom May 24 '16 at 09:27
  • Take a look [here](http://stackoverflow.com/questions/20934406/multi-threaded-reading-from-a-file-in-c). They detail some methods for multithreaded reading that may help you. – Sunreef May 24 '16 at 09:32
  • It depends on the size of your images but it is possible that if they are really big, only reading from disk may be slow. And you can't do much better than sequential reading on a single disk. – Sunreef May 24 '16 at 09:38
  • @Sunreef I would expect that the time spent on the actual disk I/O is dominated by the time taken to actually compress the image. – Dan Mašek May 24 '16 at 10:29
  • 1
    It's true that applying 10x a medianBlur with size 71 is computationally expensive. In this case, having one thread load all images on the RAM and several others working on the blur could speed up the whole process – Sunreef May 24 '16 at 11:16