0

Trying to optimize OpenCV code with openMP, code as follows. The actual execution time with openMP is longer. 2 cores, 4 threads. Image size: [3024 x 4032]

std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
std::clock_t start;
double duration;
start = std::clock();
////none, without openMP               0.129677 sec 
//#pragma omp parallel for          // 0.213286 sec
#pragma omp parallel for collapse(2)// 0.206435 sec
    for (int i = 0; i < maskedImage.rows; ++i)
        for (int j = 0; j < maskedImage.cols; ++j){
            pixelsD[i][j] = maskedImage.at<cv::Vec3b>(i, j);
            // printf("%d %d %d\n", i, j, omp_get_thread_num());
        }
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;

My guess: the reason is the context switch which takes longer. What may be other reasons? How could I optimize it utilizing available resources? Any other ways?

Input appreciated.

P.S.: The reason for the translate between cv::Mat to std::vector is to utilise erase, push_back and insert for image's content manipulation.

cived
  • 1

3 Answers3

1

Thread creation can be quite costly as well as context switches: strangely with GCC 9.3, it takes 10-20 ms to just start the parallel section on my machine on this sample code. Note that some OpenMP runtimes like Clang can create thread once for all OpenMP section. Moreover, setting OMP_PROC_BIND to TRUE can help OpenMP threads to not move between cores. Note that timings between GCC and Clang are quite different on this code.

std::clock do not measure what you probably want to: it does not consider process inactivity and sum the tick of each thread of the process. Please use C++ std::chrono::steady_clock or omp_get_wtime to correctly measure durations.

Please do not use std::vector<std::vector<cv::Vec3b>> as it use a very inefficient memory layout pattern. If you want to make complex matrix operation, you can use Eigen for example or write your own type based on contiguous flatten arrays. Splitting each color channel in a separate array may also help compiler to vectorize operations improving performance.

On Clang, the pixelsD[i][j] access produce a very slow code with OpenMP as the compiler fail to optimize it. Actually, using a collapse is not useful here as the number of threads should be much smaller than the number of rows (it could even decrease performance).

Here is a new version where the time is more correctly measured:

std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));

#pragma omp parallel
{
    double start;

    // Wait for all threads to be created and ready
    #pragma omp barrier

    #pragma omp master
    start = omp_get_wtime();

    #pragma omp for
    for (int i = 0; i < maskedImage.rows; ++i)
    {
        std::vector<cv::Vec3b>& row = pixelsD[i];

        for (int j = 0; j < maskedImage.cols; ++j)
        {
            row[j] = maskedImage.at<cv::Vec3b>(i, j);
        }
    } // Implicit barrier here

    #pragma omp master
    {
        const double duration = omp_get_wtime() - start;
        cout << duration << endl;
    }
}

// Side effect to force the compiler to not optimize the previous loop to nothing
cout << "result: " << (int)pixelsD[0][0][0] << endl;

On my 6-core machine and with an image of size 3840x2160, I get the following results:

Clang:
- initial sequential clock time: 8.5 ms
- initial parallel clock time: 60 ~ 63 ms
- new sequential time: 8.5 ms
- new parallel time: 2.4 ms

GCC:
- initial sequential clock time: 9.7 ms
- initial parallel clock time: 3 ~ 93 ms
- new sequential time: 8.5 ms
- new parallel time: 2.3 ms

Theoretical optimal time: 1.2 ms

Note that this operation can be made even faster using direct access to data of maskedImage. Note also that memory access tend to barely scale. Results are not bad here because compilers generate a quite inefficient code (although it is difficult regarding the memory layout).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    Thank you for clarifying the techniques for profiling. Your answer made me realize that my dev machine is slow, which brings hopes that at deployment even the sequential serial execution may suffice. Many thanks! – cived Aug 27 '20 at 07:47
0

Another possible explanation is this link.

It is suggested to avoid using i and j indices inside the loop code.

If I remember correctly, the data part of an OpenCV Mat uses contiguous part of the memory, at least for rows, and for the entire data in some cases. As this is also the case for vectors, you could copy the image line by line (or the entire image) instead of pixels by pixels.

tetorea
  • 141
  • 8
-1

I think threads switching too frequently (once per row), and it requires more processor time for management. It should work more effective, if you will assign larger pieces of woek for threads. An image per thread for instance.

Andrey Smorodov
  • 10,649
  • 2
  • 35
  • 42