why parallelized .at() opencv with openmp take more time

Question

i'm trying to implement image morphology operator in c++ using openMP and openCV. The algorithm works fine but, when i get the profiling result taken with VTune,i observe that the parallelized method take more time than sequential method,and it's caused by .at() openCv function. why? how i can solve it?

here's my code:

Mat Morph_op_manager::compute_morph_base_op(Mat image, bool parallel, int type) {

//strel attribute
int strel_rows = 5;
int strel_cols = 5;

//strel center coordinate
int cr = 2;
int cc = 2;

//number of row and column after strel center
int nrac = strel_rows - cr ;
int ncac = strel_cols - cr ;

//strel init


Mat strel(strel_rows,strel_cols,CV_8UC1, Scalar(0));

Mat op_result = image.clone();

if (parallel == false)
    omp_set_num_threads(1); // Use 1 threads for all consecutive parallel regions

//parallelized nested loop
#pragma omp parallel for collapse(4)

for (int i= cr ; i<image.rows-nrac; i++)
    for (int j = cc; j < image.cols -ncac; j++) {

        for (int m = 0; m < strel_rows; m++)
                for (int n = 0; n < strel_cols; n++) {


                // if type = 0 -> erode
                if (type == 0){

                    if (image.at<uchar>(i-(strel_rows-m),j-(strel_cols-n)) != strel.at<uchar>(m,n)){
                        op_result.at<uchar>(i, j) = 255;
                    }
                }

                // if type == 0 ->  dilate
                if (type == 1){

                    if (image.at<uchar>(i-(strel_rows-m),j-(strel_cols-n)) == strel.at<uchar>(m,n)){
                        op_result.at<uchar>(i, j) = 0;
                    }
                }
            }
    }
}

here's the profiling result:

sequential

parallel

SPEED-UP:

instead use the **.at()**method i access to the matrix of pixels using pointer, and i change my directive as described in the code below.

A problem persist: in my profiling log Mat ::.release() spend a lot of time why? how i can solve it?

speed-up code:

omp_set_num_threads(4);                                                                                                      
double start_time = omp_get_wtime();                                                                                         


 #pragma omp parallel for  shared(strel,image,op_result,strel_el_count)  private(i,j) schedule(dynamic)   if(parallel == true)    
for( i = cr; i < image.rows-nrac; i++)                                                                                       
{                                                                                                                            
    op_result.addref();                                                                                                      
    uchar* opresult_ptr = op_result.ptr<uchar>(i);                                                                           


    for ( j = cc; j < image.cols-ncac; j++)                                                                                  
    {                                                                                                                        


        //type == 0 --> erode                                                                                                
        if (type == 0 ){                                                                                                     
            if(is_fullfit(image,i,j,strel,strel_el_count,parallel)){                                                         

                opresult_ptr[j] = 0;                                                                                         
            }                                                                                                                
            else                                                                                                             

                 opresult_ptr[j] = 255;                                                                                      
        }                                                                                                                    



    }                                                                                                                        

}

here's the fullfit function

bool Morph_op_manager::is_fullfit(Mat image,int i,int j,Mat strel,int strel_counter,bool parallel){                                   

int mask_counter = 0;                                                                                                             
int ii=0;                                                                                                                         
int jj=0;                                                                                                                         

for ( ii = 0; ii <strel.rows ; ii++) {                                                                                            

    uchar* strel_ptr = strel.ptr<uchar>(ii);                                                                                      
    uchar* image_ptr = image.ptr<uchar>(i - (strel.rows - ii));                                                                   

    for ( jj = 0; jj <strel.cols ; ++jj) {                                                                                        

        mask_counter += (int) image_ptr[j-(strel.cols-jj)];                                                                       

    }                                                                                                                             

}                                                                                                                                 
return mask_counter == strel_counter;

}

cpu data: CPU Name: 4th generation Intel(R) Core(TM) Processor family Frequency: 2.4 GHz Logical CPU Count: 4

here's my log:

I think the problem is not the .at function itself, but rather the complex indexing arguments that prevents openmp to "fuse" the loops in an efficient way. http://stackoverflow.com/questions/28482833/understanding-the-collapse-clause-in-openmp — Sven Nilsson, May 30 '16 at 09:14
i want test if your answer solve my problem, how i can fuse my loop using only one? i've got four loop :_( — userfi, May 30 '16 at 09:51
This looks like a simple image filter, why do you want to use `collapse` anyway? — mainactual, May 30 '16 at 11:55
Please provide an [mcve], add information about your system (CPU, memory) and use the upload images function of the SO editor. — Zulan, May 30 '16 at 13:02
@mainactual i'm tring with that directive that form me is correct ,but still not works `#pragma omp parallel for shared(image,op_result) private(i,j,m,n) if(parallel == true)` — userfi, May 30 '16 at 17:41
@Zulan now i'm working with a intel i-7, i can't use the upload images of SO editor cause i haven't sufficient reputation :) — userfi, May 30 '16 at 18:02
There are many vastly different Intel i7 generations. Please help us so that we don't have to guess - again, see [mcve] (for serial and parallel!). I am pretty sure anyone can upload images with the editor (because regularly 1 rep users do that), but you might not be able to embed, but only link the uploaded image. Tinypic doesn't work with adblock & noscript and will also lead to the question being useless once it is deleted. Maybe you can just embed the textual content of the images in the question. That is preferred anyway. — Zulan, May 30 '16 at 20:26
`#pragma omp parallel for shared(image,op_result)` should work fine. Just notice, the `cv::dilate` is much faster what you can get using simply parallel for loop. — mainactual, May 31 '16 at 08:34
yes works fine,but in my profilinig log the function `Mat::release()`. spent a lot of time why? how avoid it? i update the code. — userfi, May 31 '16 at 08:48
Pass argument by reference instread by value (&), and the replace time decrease ;) — , May 31 '16 at 15:41

why parallelized .at() opencv with openmp take more time

0 Answers0