Thread creation can be quite costly as well as context switches: strangely with GCC 9.3, it takes 10-20 ms to just start the parallel section on my machine on this sample code. Note that some OpenMP runtimes like Clang can create thread once for all OpenMP section. Moreover, setting OMP_PROC_BIND
to TRUE
can help OpenMP threads to not move between cores. Note that timings between GCC and Clang are quite different on this code.
std::clock
do not measure what you probably want to: it does not consider process inactivity and sum the tick of each thread of the process. Please use C++ std::chrono::steady_clock
or omp_get_wtime
to correctly measure durations.
Please do not use std::vector<std::vector<cv::Vec3b>>
as it use a very inefficient memory layout pattern. If you want to make complex matrix operation, you can use Eigen for example or write your own type based on contiguous flatten arrays. Splitting each color channel in a separate array may also help compiler to vectorize operations improving performance.
On Clang, the pixelsD[i][j]
access produce a very slow code with OpenMP as the compiler fail to optimize it. Actually, using a collapse is not useful here as the number of threads should be much smaller than the number of rows (it could even decrease performance).
Here is a new version where the time is more correctly measured:
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
#pragma omp parallel
{
double start;
// Wait for all threads to be created and ready
#pragma omp barrier
#pragma omp master
start = omp_get_wtime();
#pragma omp for
for (int i = 0; i < maskedImage.rows; ++i)
{
std::vector<cv::Vec3b>& row = pixelsD[i];
for (int j = 0; j < maskedImage.cols; ++j)
{
row[j] = maskedImage.at<cv::Vec3b>(i, j);
}
} // Implicit barrier here
#pragma omp master
{
const double duration = omp_get_wtime() - start;
cout << duration << endl;
}
}
// Side effect to force the compiler to not optimize the previous loop to nothing
cout << "result: " << (int)pixelsD[0][0][0] << endl;
On my 6-core machine and with an image of size 3840x2160, I get the following results:
Clang:
- initial sequential clock time: 8.5 ms
- initial parallel clock time: 60 ~ 63 ms
- new sequential time: 8.5 ms
- new parallel time: 2.4 ms
GCC:
- initial sequential clock time: 9.7 ms
- initial parallel clock time: 3 ~ 93 ms
- new sequential time: 8.5 ms
- new parallel time: 2.3 ms
Theoretical optimal time: 1.2 ms
Note that this operation can be made even faster using direct access to data
of maskedImage
. Note also that memory access tend to barely scale. Results are not bad here because compilers generate a quite inefficient code (although it is difficult regarding the memory layout).