How to use cv::parallel_for_ for execution time reduction

Question

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)). I optimized the pixel loop according to the OpenCV docs.

Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter. First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).

    virtual void operator()(const cv::Range& range) const
    {
        for(int i = range.start; i < range.end; i++)
        {
            // divide image in 'thr' number of parts and process simultaneously
            cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
            cv::Mat in = img(roi);
            cv::Mat out = retVal(roi);
            out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
        }

I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:

template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
    typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
    cv::Mat& image; //source and result image (to be overwritten)
    bool cont; //if the image is continuous
    size_t rows;
    size_t cols;
    size_t threads;
    std::vector<cv::Range> ranges;
    pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
    ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
        : image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
    {
        int groupSize = 1;
        if (cont) {
            cols *= rows;
            rows = 1;
            groupSize = ceil( cols / threads );
        }
        else {
            groupSize = ceil( rows / threads );
        }

        int t = 0;
        for(t=0; t<threads-1; ++t) {
            ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
        }
        ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
    }

    virtual void operator()(const cv::Range& range) const
    {
        for(int r = range.start; r < range.end; r++)
        {
            T* Ip = nullptr;
            cv::Range ran = ranges.at(r);
            if(cont) {
                Ip = image.ptr<T>(0);
                for (int j = ran.start; j < ran.end; ++j)
                {
                    Ip[j] = pixelProcessingFunction(Ip[j]);
                }
            }
            else {
                for(int i = ran.start; i < ran.end; ++i)
                {
                    Ip = image.ptr<T>(i);
                    for (int j = 0; j < cols; ++j)
                    {
                        Ip[j] = pixelProcessingFunction(Ip[j]);
                    }
                }
            }
        }
    }
};

Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:

double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";

I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way? I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.

Is there any other way to reduce the time of this function?

Thanks in advance!

score 1 · Answer 1 · edited May 23 '17 at 12:06

Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.

So, in your situation i would try few things:

measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:

double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

How to use cv::parallel_for_ for execution time reduction

1 Answers1