Find max element in array OpenMP and PPL versions run much slower than serial code

Question

I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.

With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):

serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)

Here's the code:

PPL:

float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
    float max_element;
    int row, col;
    concurrency::combinable<float> locals([] { return (float)INT_MIN; });
    concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
    {
        float &localMax = locals.local();
        localMax = max<float>(localMax, curr);
    });

    max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
    return max_element;
}

OpenMP:

float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
    float max_value = 0.0;
    int i, row, col, index;

    #pragma omp parallel for private(i) shared(max_value, index)
    for (i = 0; i < dims * dims; ++i)
    {
#pragma omp critical
        if (m[i] > max_value)
        {
            max_value = m[i];
            index = i;
        }
    }

    //row = index / dims;
    //col = index % dims;
    return max_value;
}

What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?

I don't know PPL, but your OMP code is not parallel at all. Get rid of the critical section in the loop and reduce the result when the loop is done. — Baum mit Augen, Dec 19 '15 at 15:06
@BaummitAugen getting rid of critical section indeed made the code run 5 times faster than the serial variant! But what do you mean by 'reduce the result when the loop is done'? — Denis Yakovenko, Dec 19 '15 at 15:10
If you are using libstdc++, you can also try `std::max_element` with the [parallel mode](https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html). That will handle the OMP stuff for you. — Baum mit Augen, Dec 19 '15 at 15:10
*"But what do you mean by 'reduce the result when the loop is done'?"* Let every core find its own maximum and combine them when all cores are done. [First hit on google](http://www.techdarting.com/2013/06/openmp-min-max-reduction-code.html), did not test myself. In general, you need to try pretty hard to avoid shared data if you want fast multithreaded code. — Baum mit Augen, Dec 19 '15 at 15:12
unfortunately, the version of OpenMP I use doesn't allow `max` in `reduction` directive — Denis Yakovenko, Dec 19 '15 at 15:14
You can easily do this by hand, analog to [this](http://stackoverflow.com/a/20421200/3002139) answer. — Baum mit Augen, Dec 19 '15 at 15:16
See also https://stackoverflow.com/questions/978222/openmp-c-algorithms-for-min-max-median-average — Baum mit Augen, Dec 19 '15 at 15:21

score 2 · Accepted Answer · edited May 23 '17 at 12:15

So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously. Removing critical section did the trick.

As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.

Update

So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):

float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
    float max_value = 0.0;
    int i, row, col, index;

    #pragma omp parallel for
    for (i = 0; i < dims * dims; ++i)
    {
        if (m[i] > max_value)
        {
            #pragma omp critical
            max_value = m[i];
        }
    }
    return max_value;
}

PS: not tested.

Find max element in array OpenMP and PPL versions run much slower than serial code

1 Answers1

Update