Parallelize counting for-loops with openmp

Question

I have an 2d-image where I want to count all colors and store the result in an array. I know the number of colors, so I can set the size of the array before. My problem now is that the counting lasts too long for me. How can I speed the counting up with OpenMP?
My current serial code is

std::vector<int> ref_color_num_thread;
    ref_color_num.resize(ref_color.size());
    std::fill(ref_color_num.begin(), ref_color_num.end(), 0);
    ref_color_num_thread.resize(ref_color.size());
    std::fill(ref_color_num_thread.begin(), ref_color_num_thread.end(), 0);

    for (int i = 0; i < image.width(); i++)
    {
        for (int j = 0; j < image.height(); j++)
        {
            for (int k = 0; k < (int)ref_color.size(); k++)
            {
                if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
                    ref_color_num_thread[k]++;
            }
        }
    }

First approaches were setting #pragma omp parallel for at each loop (each try at another), but everytime I get a program crash because of wrong memory access. Do I have to use private() for my vector?

score 3 · Accepted Answer · edited May 23 '17 at 10:32

What you're doing is filling a histogram of your colors. This is equivalence to doing an array reduction in C/C++ with OpenMP. In C/C++ OpenMP does not have built in support for this (but it does in Fortran due to the fact that the array size is known in Fortran where in C/C++ it's only known for static arrays). However, it's easy to do an array reduction in C/C++ with OpenMP yourself.

#pragma omp parallel 
{
    std:vector<int> ref_color_num_thread_private(ref_color.size(),0);
    #pragma omp for
    for (int i = 0; i < image.width(); i++) {
        for (int j = 0; j < image.height(); j++) {
            for (int k = 0; k < (int)ref_color.size(); k++) {
                if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B) 
                    ref_color_num_thread_private[k]++;
            }
        }
    }
    #pragma omp critical 
    {   
        for(int i=0; i<(int)ref_color.size(); i++) {
             ref_color_num_thread[i] += ref_color_num_thread_private[i];
        }
    }
}

I went into a lot more detail about his here Fill histograms (array reduction) in parallel with OpenMP without using a critical section

I showed how to an array reduction without a critical section but it's a lot more tricky. You should test the first case and see if it works well for you first. As long as the number of colors (ref_color.size()) is small compared to the number of pixels it should parallelize well. Otherwise, you might need to try the second case without a critical section.

Philipp Claßen · Answer 2 · 2014-06-02T22:28:08.273

1

There is a race condition if one of the outer two loops (i or j) are parallized, because the inner loop iteratates over the vector (k). I think your crash is because of that.

You have to restructure your program. It is not trivial, but one idea is that each thread uses a local copy of the ref_color_num_thread vector. Once the computation is finished, you can sum up all the vectors.

If k is large enough to provide enough parallelism, you could exchange the loops. Instead of "i,j,k" you could iterate in the order "k,i,j". If I'm not mistaken, there are no violated dependencies. Then you can parallelize the outer k loop, and let the inner i and j loops execute sequentially.

Update:

pragma omp for also supports reductions, for example:

#pragma omp for reduction(+ : nSum)

Here is a link to some documentation.

Maybe that can help you to restructure your program.

edited Jun 02 '14 at 22:28

answered Jun 02 '14 at 21:54

Philipp Claßen

41,306
31
146
239

Can you give me an example of spreading the integers? I'm not so good in programming this stuff. Thank you! – arc_lupus Jun 02 '14 at 21:58
@arc_lupus I updated my post. I have overlook that you cannot easily parallelize the i and j loops. – Philipp Claßen Jun 02 '14 at 22:03
Yes, if I set the #pragma before the outer most loop, my code is faster, but is changing the results from time to time. – arc_lupus Jun 02 '14 at 22:04
@arc_lupus that is because you have a race condition. See this answer for explanation about what that is: http://stackoverflow.com/questions/34510/what-is-a-race-condition. The simplest solution is to serialize the updates to `ref_color_num_thread` but that totally defeats the purpose of multithreading. You'll need to do as @phillip suggested and restructure your algorithm and/or data structure. – maxywb Jun 02 '14 at 22:09
@maxywb: Separate copies of each vector isn't a good idea for large vectors in my opinion. Nevertheless, how do I create these separate copies? With `private(vector)`? – arc_lupus Jun 02 '14 at 22:15
@arc_lupus There are other ways of writing the algorithm so each thread won't be stepping on the toes of other threads. But you'll have to figure out what works for your application. – maxywb Jun 03 '14 at 14:42
@maxywb: I already solved my problem with the solution provided above, but if I have time, I will continue with my first approach and your suggestion. – arc_lupus Jun 03 '14 at 20:12
@arc_lupus the solution above is one way of keeping the threads from stepping on each other's toes and generally doing some kind of reduction is the best solution. – maxywb Jun 03 '14 at 20:26

Parallelize counting for-loops with openmp

2 Answers2