cv::parallel_for_ not very big improvement

Question

I'm testing the class cv::ParallelLoopBody for image processing code.

I first started implementing the normalization, where I've to divide all the pixels with certain values for each channel, which is an easy nice parallelized code.

However, when testing it, I don't see any difference.

Am I doing something wrong here?

This is my class:

class Parallel_process : public cv::ParallelLoopBody
    {

    private:
        cv::Mat img; //my image to normalize
        std::vector<int> A;
        int diff;

    public:
        Parallel_process(cv::Mat inputImage, std::vector<int> AA, int diffVal)
                           : img(inputImage), A(AA), diff(diffVal){}

        virtual void operator()(const cv::Range& range) const
        {
            for(int i = range.start; i < range.end; i++)
            {
              //in is a patch of my original image
               cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i, img.cols, img.rows/diff));
               std::vector<int> AAA (A);
               in.forEach<cv::Vec3f>
                (
                  [&AAA](cv::Vec3f &pixel, const int* po) -> void
                  {
                    pixel[0]/=AAA[0];
                    pixel[1]/=AAA[1];
                    pixel[2]/=AAA[2];
                  }
                );
            }
        }
};

and in main() function I'm calling my operator like so:

cv::parallel_for_(cv::Range(0, 91), Parallel_process(img, AA, 91)); //my image is 1288*728 size so 728/91=8

EDIT

This my OpenCV configuation:

General configuration for OpenCV 3.3.1 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            /home/jrsros/opencv_contrib-3.3.1/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2017-12-14T13:05:47Z
    Host:                        Linux 4.10.0-40-generic x86_64
    CMake:                       3.5.1
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2
      SSE4_1 (3 files):          + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (1 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (5 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (8 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2

  C/C++:
    Built as dynamic libs?:      YES
    C++11:                       YES
    C++ Compiler:                /usr/bin/c++  (ver 7.2.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wno-narrowing -Wno-delete-non-virtual-dtor -Wno-comment -Wno-implicit-fallthrough -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -fopenmp -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wno-narrowing -Wno-delete-non-virtual-dtor -Wno-comment -Wno-implicit-fallthrough -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -fopenmp -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/gcc-5
    C flags (Release):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-narrowing -Wno-comment -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections  -msse -msse2 -msse3 -fvisibility=hidden -fopenmp -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-narrowing -Wno-comment -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections  -msse -msse2 -msse3 -fvisibility=hidden -fopenmp -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):
    Linker flags (Debug):
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          dl m pthread rt /usr/lib/x86_64-linux-gnu/libGLU.so /usr/lib/x86_64-linux-gnu/libGL.so /usr/lib/x86_64-linux-gnu/libtbb.so cudart nppc nppial nppicc nppicom nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cufft -L/usr/local/cuda-8.0/lib64
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 cudev core cudaarithm flann hdf imgproc ml objdetect phase_unwrapping plot reg surface_matching video viz xphoto bgsegm cudabgsegm cudafilters cudaimgproc cudawarping dnn face freetype fuzzy img_hash imgcodecs photo shape videoio xobjdetect cudacodec highgui bioinspired dpm features2d line_descriptor saliency text calib3d ccalib cudafeatures2d cudalegacy cudaobjdetect cudaoptflow cudastereo datasets rgbd stereo structured_light superres tracking videostab xfeatures2d ximgproc aruco optflow stitching python2
    Disabled:                    js world contrib_world
    Disabled by dependency:      -
    Unavailable:                 java python3 ts cnn_3dobj cvv dnn_modern matlab sfm

  GUI: 
    QT:                          NO
    GTK+ 2.x:                    YES (ver 2.24.30)
    GThread :                    YES (ver 2.48.2)
    GtkGlExt:                    YES (ver 1.2.0)
    OpenGL support:              YES (/usr/lib/x86_64-linux-gnu/libGLU.so /usr/lib/x86_64-linux-gnu/libGL.so)
    VTK support:                 YES (ver 6.2.0)

  Media I/O: 
    ZLib:                        /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.8)
    JPEG:                        /usr/lib/x86_64-linux-gnu/libjpeg.so (ver )
    WEBP:                        /usr/lib/x86_64-linux-gnu/libwebp.so (ver encoder: 0x0202)
    PNG:                         /usr/lib/x86_64-linux-gnu/libpng.so (ver 1.2.54)
    TIFF:                        /usr/lib/x86_64-linux-gnu/libtiff.so (ver 42 - 4.0.6)
    JPEG 2000:                   /usr/lib/x86_64-linux-gnu/libjasper.so (ver 1.900.1)
    OpenEXR:                     build (ver 1.7.1)
    GDAL:                        NO
    GDCM:                        NO

  Video I/O:
    DC1394 1.x:                  NO
    DC1394 2.x:                  NO
    FFMPEG:                      YES
      avcodec:                   YES (ver 56.60.100)
      avformat:                  YES (ver 56.40.101)
      avutil:                    YES (ver 54.31.100)
      swscale:                   YES (ver 3.1.101)
      avresample:                NO
    GStreamer:                   
      base:                      YES (ver 1.8.3)
      video:                     YES (ver 1.8.3)
      app:                       YES (ver 1.8.3)
      riff:                      YES (ver 1.8.3)
      pbutils:                   YES (ver 1.8.3)
    OpenNI:                      NO
    OpenNI PrimeSensor Modules:  NO
    OpenNI2:                     NO
    PvAPI:                       NO
    GigEVisionSDK:               NO
    Aravis SDK:                  NO
    UniCap:                      NO
    UniCap ucil:                 NO
    V4L/V4L2:                    NO/YES
    XIMEA:                       NO
    Xine:                        NO
    Intel Media SDK:             NO
    gPhoto2:                     NO

  Parallel framework:            TBB (ver 4.4 interface 9002)

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Use Intel IPP:               2017.0.3 [2017.0.3]
               at:               /home/jrsros/opencv-3.3.1/build/3rdparty/ippicv/ippicv_lnx
    Use Intel IPP IW:            sources (2017.0.3)
                  at:            /home/jrsros/opencv-3.3.1/build/3rdparty/ippicv/ippiw_lnx
    Use VA:                      NO
    Use Intel VA-API/OpenCL:     NO
    Use Lapack:                  NO
    Use Eigen:                   YES (ver 3.2.92)
    Use Cuda:                    YES (ver 8.0)
    Use OpenCL:                  YES
    Use OpenVX:                  NO
    Use custom HAL:              NO

  NVIDIA CUDA
    Use CUFFT:                   YES
    Use CUBLAS:                  YES
    USE NVCUVID:                 NO
    NVIDIA GPU arch:             20 30 35 37 50 52 60 61
    NVIDIA PTX archs:
    Use fast math:               YES

  OpenCL:                        <Dynamic loading of OpenCL library>
    Include path:                /home/jrsros/opencv-3.3.1/3rdparty/include/opencl/1.2
    Use AMDFFT:                  NO
    Use AMDBLAS:                 NO

  Python 2:
    Interpreter:                 /usr/bin/python2.7 (ver 2.7.12)
    Libraries:                   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
    numpy:                       /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)
    packages path:               lib/python2.7/dist-packages

  Python 3:
    Interpreter:                 /usr/bin/python3 (ver 3.5.2)

  Python (for build):            /usr/bin/python2.7

  Java:
    ant:                         NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Matlab:
    mex:                         /usr/local/MATLAB/R2017b/bin/mex
    Compiler/generator:          Not working (bindings will not be generated)

  Documentation:
    Doxygen:                     NO

  Tests and samples:
    Tests:                       NO
    Performance tests:           NO
    C/C++ Examples:              NO

  Install path:                  /usr/local

  cvconfig.h is in:              /home/jrsros/opencv-3.3.1/build
-----------------------------------------------------------------

An example of code I'm using to test my parallelism:

#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/core/utility.hpp>

#include <iostream>
#include <sys/time.h>

class Parallel_process : public cv::ParallelLoopBody
    {

    private:
        cv::Mat img;
        std::vector<int> A;
        int diff;

    public:
        Parallel_process(cv::Mat inputImgage, std::vector<int> AA, int diffVal)
                           : img(inputImgage), A(AA), diff(diffVal){}

        virtual void operator()(const cv::Range& range) const
        {
            for(int i = range.start; i < range.end; i++)
            {

               cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i, img.cols, img.rows/diff));
               std::vector<int> AAA (A);
               in.forEach<cv::Vec3f>
                (
                  [&AAA](cv::Vec3f &pixel, const int* po) -> void
                  {
                    pixel[0]/=AAA[0];
                    pixel[1]/=AAA[1];
                    pixel[2]/=AAA[2];
                  }
                );
            }
        }
    };

int main(int argc, char* argv[])
{

    cv::Mat src=cv::imread(argv[1]);

    std::vector<std::vector<int> > AA;

    //compute AA here

    double timeStart=tic(), timeEnd=0;

    //Normalize the image /AA               
    cv::parallel_for_(cv::Range(0, 91), Parallel_process(src, AA, 91));   

    timeEnd = tic() - timeStart;
    std::cout << "ALL " <<1/timeEnd << std::endl<< std::endl; //FPS

    return 0;
}

I'm compiling my code on linux ubuntu Core i7-2630QM 2Ghz * 8 threads (4 cores):

g++ -std=c++1z -Wall -Weffc++ -Ofast -march=native test4.cpp -o test4 `pkg-config --cflags --libs opencv`

EDIT2 In htop I see that it is using all the threads at the end

1 2 3 4 5 6 7 8 
Variant 1
threads,mean(us),min(us),max(us)
1,18798.2,18106,22780
2,9762.84,9427,10397
3,6813.55,6765,9296
4,5317.75,5088,7433
5,5067.11,4931,7552
6,4925.41,4780,9473
7,4797.74,4641,9492
8,4798.18,4504,27244

1 2 3 4 5 6 7 8 
Variant 2
threads,mean(us),min(us),max(us)
1,18512.8,17780,20084
2,9788.73,9302,11338
3,6850.47,6671,9765
4,5209.64,5022,8831
5,5052.46,4881,7041
6,6851.99,4762,11422
7,5077.32,4624,9886

Can you show the output of [`cv::getBuildInformation()`](https://docs.opencv.org/3.1.0/db/de0/group__core__utils.html#ga0ae377100bc03ce22322926bba7fdbb5)? And a [mcve] (the whole program you're measuring, not just bits and pieces). — Dan Mašek, Dec 14 '17 at 23:37
As Dan has already noted, would you also kindly present **what are your speedup expectations from a parallelised code-execution**, expressed as a factor, relative to the pure-`[SERIAL]` code-execution? Next, would you mind to also specify the hardware, that you are going to deploy such code at? ( Number of CPU-cores, CPU-class ( architecture ), CPU-cache-sizes [MB], for the latter question ) — user3666197, Dec 15 '17 at 00:51
@Nayfe When I look at htop, it doesn't seem to use more than 1 or 2 threads.. It shows only one threads at 100% for both cases with parallel_for and forEach — Ja_cpp, Dec 15 '17 at 09:03
Did you look this [page](https://docs.opencv.org/trunk/d7/dff/tutorial_how_to_use_OpenCV_parallel_for_.html). You maybe have a memory bottleneck, what works well is working with opencv methods as they have lots of optimizations (SSE etc). For instance, can't u just split 3 channels in three different images, then do normalization and merge the result? — Nayfe, Dec 15 '17 at 09:25
@Ja_cpp So, do I understand it correctly that your measurement consists of running the algorithm *once* on a *single* image? In other words, you're trying to observe a roughly 1 millisecond interval of CPU usage? IMHO that's where the problem lies. — Dan Mašek, Dec 15 '17 at 16:28
@DanMašek I have a long code and that was a part where I'm testing the parallelism and my whole code is running on 13000 images.. so while it is processing I can check on htop which and how many threads my code are using — Ja_cpp, Dec 15 '17 at 20:52
Ah, ok -- do you `imread` all 13000 first, and then work on them, or do you have loop that does a sequence `imread`, process, output? | How does [this little program](https://pastebin.com/MJHpqsXv) perform on your machine? I don't have a Linux machine on hand to try it, but it seems to scale as I would expect on my Windows machine. — Dan Mašek, Dec 15 '17 at 22:28
@DanMašek I've a loop that reads an image every time, processing it and then move to the next. (I edited my question) — Ja_cpp, Dec 15 '17 at 22:47
@Ja_cpp OK. In that case, what's likely to be happening (based on my local observations) is that the `imread` (which is single threaded) dominates -- it actually takes ~1.25 times the time of running the algorithm single threaded, or ~8.0 times the time of running it on all logical CPUs. | The timings of my app you added in the last edit suggest the algorithm scales across the CPUs as expected. So ya, I think the earlier statement holds. — Dan Mašek, Dec 15 '17 at 23:05
@Ja_cpp For this many images, I'd pipeline it more. Have a threadpool that just `imread`s images and populates a synchronized queue (with some limit on how many images are in the queue). Another threadpool that consumes input image queue, processes them, and puts in them in output queue. One more threadpool that consumes the output queue and does whatever is appropriate. Experiment with it to determine the best number of threads for each threadpool to make it saturate. | it seems the way you used `forEach` and `parallel_for_` were OK, you just weren't observing the right thing. — Dan Mašek, Dec 15 '17 at 23:15
@DanMašek Thank you for all the information. I do agree with you, what you said seems to be logic, I also tested like: for(;;)parallel_for_ and it is using all threads, that proves you observations. — Ja_cpp, Dec 16 '17 at 10:36

cv::parallel_for_ not very big improvement

0 Answers0