How to parallelize this array correct way using OpenMP?

Question

After I try to parallelize the code with openmp, the elements in the array are wrong, as for the order of the elements is not very important. Or is it more convenient to use c++ std vector instead of array to parallelize, could you suggest a easy way?

#include <stdio.h>
#include <math.h>

int main()
{
    int n = 100;
    int a[n*(n+1)/2]={0};
    int count=0;

    #pragma omp parallel for reduction(+:a,count)
    for (int i = 1; i <= n; i++) {
        for (int j = i + 1; j <= n; j++) {
            double k = sqrt(i * i + j * j);
            if (fabs(round(k) - k) < 1e-10) {
                a[count++] = i;
                a[count++] = j;
                a[count++] = (int) k;
            }
        }
    }
    
    for(int i=0;i<count;i++)
        printf("%d %s",a[i],(i+1)%3?"":", ");
    printf("\ncount: %d", count);
    return 0;
}

Original output:

3 4 5 , 5 12 13 , 6 8 10 , 7 24 25 , 8 15 17 , 9 12 15 , 9 40 41 , 10 24 26 , 11 60 61 , 12 16 20 , 12 35 37 , 13 84 85 , 14 48 50 , 15 20 25 , 15 36 39 , 16 30 34 , 16 63 65 , 18 24 30 , 18 80 82 , 20 21 29 , 20 48 52 , 20 99 101 , 21 28 35 , 21 72 75 , 24 32 40 , 24 45 51 , 24 70 74 , 25 60 65 , 27 36 45 , 28 45 53 , 28 96 100 , 30 40 50 , 30 72 78 , 32 60 68 , 33 44 55 , 33 56 65 , 35 84 91 , 36 48 60 , 36 77 85 , 39 52 65 , 39 80 89 , 40 42 58 , 40 75 85 , 40 96 104 , 42 56 70 , 45 60 75 , 48 55 73 , 48 64 80 , 48 90 102 , 51 68 85 , 54 72 90 , 56 90 106 , 57 76 95 , 60 63 87 , 60 80 100 , 60 91 109 , 63 84 105 , 65 72 97 , 66 88 110 , 69 92 115 , 72 96 120 , 75 100 125 , 80 84 116 ,
count: 189

After using OpenMP(gcc file.c -fopenmp):

411 538 679 , 344 609 711 , 354 533 649 , 218 387 449 , 225 475 534 , 182 283 339 , 81 161 182 , 74 190 204 , 77 138 159 , 79 176 195 , 18 24 30 , 18 80 82 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 ,
count: 189

as you are asking for `vector` this seems to be C++, not C. You should then use the C++ headers and not tag C — 463035818_is_not_an_ai, Jun 10 '21 at 11:46
also note that `int a[n*(n+1/2)]={0};` is ok in C but in C++ only as a compiler extension: [Why aren't variable-length arrays part of the C++ standard?](https://stackoverflow.com/questions/1887097/why-arent-variable-length-arrays-part-of-the-c-standard) — 463035818_is_not_an_ai, Jun 10 '21 at 11:47
btw the formula for the size looks wrong, as `n` is an `int` your `n*(n+1/2)` is always the same as `n*n` (while your loop uses much less elements) — 463035818_is_not_an_ai, Jun 10 '21 at 11:52
Your `reduction` clause is not doing what you expect it to do. If I interpret this right, you want to concatenate the private versions of `a`, not add them together. — paleonix, Jun 10 '21 at 11:54
@463035818_is_not_a_number Indeed, the formula for the array size is wrong and has been modified. — expression, Jun 10 '21 at 11:57
Why do you wish to parallelize this code? Its workload is so small that it won't be faster.. — Laci, Jun 10 '21 at 12:19
@Laci This is a simplified code, the real code is more computationally intensive, but very similar. — expression, Jun 10 '21 at 12:21
Is the order important? if not, it is easy to parallelize... — Laci, Jun 10 '21 at 12:22
@Laci The order is not important, could you tell me how to correct it? — expression, Jun 10 '21 at 12:28
You might want to use `(int) round(k)` to not get rounded down in cases where `k` is smaller than `round(k)`. — paleonix, Jun 10 '21 at 12:41

Toby Speight · Answer 1 · 2021-06-11T06:20:38.513

3

Your threads are all accessing the shared count.

You would be better off eliminating count and have each loop iteration determine where to write its output based only on the (per-thread) values of i and j.

Alternatively, use a vector to accumulate the results:

#include <cmath>
#include <iostream>
#include <utility>
#include <vector>

#pragma omp declare                                                        \
    reduction(vec_append : std::vector<std::pair<int,int>> :               \
              omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

int main()
{
    constexpr int n = 100'000;
    std::vector<std::pair<int,int>> result;

#pragma omp parallel for \
            reduction(vec_append:result) \
            schedule(dynamic)
    for (int i = 1;  i <= n;  ++i) {
        for (int j = i + 1;  j <= n;  ++j) {
            auto const h2 = i * i + j * j; // hypotenuse squared
            int h = std::sqrt(h2) + 0.5;   // integer square root
            if (h * h == h2) {
                result.emplace_back(i, j);
            }
        }
    }

    // for (auto const& v: result) {
    //     std::cout << v.first << ' '
    //               << v.second << ' '
    //               << std::hypot(v.first, v.second) << ' ';
    // }
    std::cout << "\ncount: " << result.size() << '\n';
}

edited Jun 11 '21 at 06:20

answered Jun 10 '21 at 12:12

Toby Speight

27,591
48
66
103

Not sure if this interacts in the right way with OpenMP's `reduction` clause, but at least in the serial version it would be clever to use `std::vector::reserve` here to not have reallocations inside the loops. What certainly should work, would be separating the `parallel` and the `for` region and do the `reserve` in between with `#pragma omp parallel private(result)`, right? – paleonix Jun 10 '21 at 15:08
It's hard to know how much to `reserve()`, unless there's some formula for estimating the number of Pythagorean triangles we expect to find. OP code is very conservative, assuming ⅓ of triples would be matches. And yes, we'd need to reserve both the private and shared versions of `result`, which is probably even harder to estimate. So I just ignored the issue and let the vector do its own thing. – Toby Speight Jun 10 '21 at 15:52
Fair enough. I guess one could reserve the maximum (`n * (n - 1) / 2`) for the shared version and divide that number by `omp_get_num_threads()` for the private ones, but one would have to benchmark, if it makes a significant, positive difference. – paleonix Jun 10 '21 at 15:59
Yes, would be worth benchmarking and comparing. FWIW, I get the result for `n = 50'000` in about 20 seconds, so performance is reasonable. – Toby Speight Jun 10 '21 at 16:00
1

While `std::vector::reserve()` doesn't seem to do much here, there are two other details that improved on your solution by almost a factor of 3 on my system: 1. use `schedule(dynamic)`, as the workload will be very unbalanced with the default static scheduling. 2. Don't use `std::hypot` here, as it will compute the squares in `double` format. Multiplying doubles is much more expensive than multiplying `int`s or even `long`s (which will be needed to prevent overflow). As this algorithm is not memory-bound (on my system), this kind of stuff is actually significant. – paleonix Jun 10 '21 at 23:10
1

I forgot that `std::hypot()` isn't overloaded for integers - good call. I was surprised that `schedule(dynamic)` makes a difference - I forgot that that the most time would be spent inside the `if` condition. I'll update the code with these changes. Thanks again! – Toby Speight Jun 11 '21 at 06:10

score 2 · Answer 2 · answered Jun 10 '21 at 12:14

The count variable is an index into a. The reduction(+:a,count) operator is summing the array, it is not a concatenation operation which is what I think you are looking for.

The count variable needs to be surrounded by a mutex, something like #pragma omp critical, but I am not an OpenMP expert.

Alternatively, create int a[n][n], set all of them to -1 (a sentinel value to indicate "invalid") then assign the result of the sqrt() when it is near enough to a whole number.

paleonix · Accepted Answer · 2021-06-12T13:14:25.013

As an alternative to using a critical section, this solution uses atomics and could therefore be faster.

The following code might freeze your computer due to memory consumption. Be careful!

#include <cstdio>
#include <cmath>

#include <vector>

int main() {
    int const n = 100;
    // without a better (smaller) upper_bound this is extremely
    // wasteful in terms of memory for big n 
    long const upper_bound = 3L * static_cast<long>(n) *
                             (static_cast<long>(n) - 1L) / 2l; 
    std::vector<int> a(upper_bound, 0);
    int count = 0;

    #pragma omp parallel for schedule(dynamic) shared(a, count)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                int my_pos;
                #pragma omp atomic capture
                my_pos = count++;

                a[3 * my_pos] = i;
                a[3 * my_pos + 1] = j;
                a[3 * my_pos + 2] = static_cast<int>(std::round(k));
            }
        }
    }
    count *= 3;
    
    for(int i = 0; i < count; ++i) {
        std::printf("%d %s", a[i], (i + 1) % 3 ? "" : ", ");
    }
    printf("\ncount: %d", count);

    return 0;
}

EDIT: My answer was initially a reaction to a by now deleted answer using a critical section in a sub-optimal way. In the following I will present another solution which combines a critical section with using std::vector::emplace_back() to circumvent the need for upper_bound similar to Toby Speight's solution. Generally using a reduce clause like in Toby Speight's solution should be preferred over critical sections and atomics, as reductions should scale better for big numbers of threads. In this particular case (relatively few calculations will be written to a) and without a big amount of cores to run on, the following code might still be preferable.

#include <cstdio>
#include <cmath>

#include <tuple>
#include <vector>

int main() {
    int const n = 100;

    std::vector<std::tuple<int, int, int>> a{};
    
    // optional, might reduce number of reallocations
    a.reserve(2 * n); // 2 * n is an arbitrary choice

    #pragma omp parallel for schedule(dynamic) shared(a)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                #pragma omp critical
                a.emplace_back(i, j, static_cast<int>(std::round(k)));
            }
        }
    }
    long const count = 3L * static_cast<long>(a.size());
    
    for(unsigned long i = 0UL; i < a.size(); ++i) {
        std::printf("%d %d %d\n",
                    std::get<0>(a[i]), std::get<1>(a[i]), std::get<2>(a[i]));
    }
    printf("\ncount: %ld", count);

    return 0;
}

2 corrections: std::vector a(n, 0); ---- std::vector a(n*(n-1)/2, 0); mypos should be my_pos in a[3 * mypos + 2] = static_cast(std::round(k)); — Laci, Jun 10 '21 at 14:08
This version runs faster than @Toby Speight's one. n=20'000 g++ (10.2) -O3 -fopenmp Results: real 0m0.231s vs. 0m0.267s user 0m1.060s vs 0m2.443s — Laci, Jun 10 '21 at 17:02
@Laci Even if this result were independent of hardware and `n`, it might still be different for the full code (vs minimal example) of the OP or any future reader. So to me it doesn't really matter which one is faster for the minimal example. — paleonix, Jun 10 '21 at 17:16
OK. But there is a strange issue: if n=50'000 the counts are different.. 226383 vs. 260250 (3*86750). Note that, I have changed int const n = 100; to size_t const n = 50000; Any idea why? — Laci, Jun 10 '21 at 17:23
OK. I have figured it out: 50000*50000 is bigger than the maximum of int — Laci, Jun 10 '21 at 17:37

How to parallelize this array correct way using OpenMP?

3 Answers3

Linked