1

After I try to parallelize the code with openmp, the elements in the array are wrong, as for the order of the elements is not very important. Or is it more convenient to use c++ std vector instead of array to parallelize, could you suggest a easy way?

#include <stdio.h>
#include <math.h>

int main()
{
    int n = 100;
    int a[n*(n+1)/2]={0};
    int count=0;

    #pragma omp parallel for reduction(+:a,count)
    for (int i = 1; i <= n; i++) {
        for (int j = i + 1; j <= n; j++) {
            double k = sqrt(i * i + j * j);
            if (fabs(round(k) - k) < 1e-10) {
                a[count++] = i;
                a[count++] = j;
                a[count++] = (int) k;
            }
        }
    }
    
    for(int i=0;i<count;i++)
        printf("%d %s",a[i],(i+1)%3?"":", ");
    printf("\ncount: %d", count);
    return 0;
}

Original output:

3 4 5 , 5 12 13 , 6 8 10 , 7 24 25 , 8 15 17 , 9 12 15 , 9 40 41 , 10 24 26 , 11 60 61 , 12 16 20 , 12 35 37 , 13 84 85 , 14 48 50 , 15 20 25 , 15 36 39 , 16 30 34 , 16 63 65 , 18 24 30 , 18 80 82 , 20 21 29 , 20 48 52 , 20 99 101 , 21 28 35 , 21 72 75 , 24 32 40 , 24 45 51 , 24 70 74 , 25 60 65 , 27 36 45 , 28 45 53 , 28 96 100 , 30 40 50 , 30 72 78 , 32 60 68 , 33 44 55 , 33 56 65 , 35 84 91 , 36 48 60 , 36 77 85 , 39 52 65 , 39 80 89 , 40 42 58 , 40 75 85 , 40 96 104 , 42 56 70 , 45 60 75 , 48 55 73 , 48 64 80 , 48 90 102 , 51 68 85 , 54 72 90 , 56 90 106 , 57 76 95 , 60 63 87 , 60 80 100 , 60 91 109 , 63 84 105 , 65 72 97 , 66 88 110 , 69 92 115 , 72 96 120 , 75 100 125 , 80 84 116 ,
count: 189

After using OpenMP(gcc file.c -fopenmp):

411 538 679 , 344 609 711 , 354 533 649 , 218 387 449 , 225 475 534 , 182 283 339 , 81 161 182 , 74 190 204 , 77 138 159 , 79 176 195 , 18 24 30 , 18 80 82 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 ,
count: 189

expression
  • 177
  • 4

3 Answers3

3

Your threads are all accessing the shared count.

You would be better off eliminating count and have each loop iteration determine where to write its output based only on the (per-thread) values of i and j.

Alternatively, use a vector to accumulate the results:

#include <cmath>
#include <iostream>
#include <utility>
#include <vector>

#pragma omp declare                                                        \
    reduction(vec_append : std::vector<std::pair<int,int>> :               \
              omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

int main()
{
    constexpr int n = 100'000;
    std::vector<std::pair<int,int>> result;

#pragma omp parallel for \
            reduction(vec_append:result) \
            schedule(dynamic)
    for (int i = 1;  i <= n;  ++i) {
        for (int j = i + 1;  j <= n;  ++j) {
            auto const h2 = i * i + j * j; // hypotenuse squared
            int h = std::sqrt(h2) + 0.5;   // integer square root
            if (h * h == h2) {
                result.emplace_back(i, j);
            }
        }
    }

    // for (auto const& v: result) {
    //     std::cout << v.first << ' '
    //               << v.second << ' '
    //               << std::hypot(v.first, v.second) << ' ';
    // }
    std::cout << "\ncount: " << result.size() << '\n';
}
Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • Not sure if this interacts in the right way with OpenMP's `reduction` clause, but at least in the serial version it would be clever to use `std::vector::reserve` here to not have reallocations inside the loops. What certainly should work, would be separating the `parallel` and the `for` region and do the `reserve` in between with `#pragma omp parallel private(result)`, right? – paleonix Jun 10 '21 at 15:08
  • It's hard to know how much to `reserve()`, unless there's some formula for estimating the number of Pythagorean triangles we expect to find. OP code is very conservative, assuming ⅓ of triples would be matches. And yes, we'd need to reserve both the private and shared versions of `result`, which is probably even harder to estimate. So I just ignored the issue and let the vector do its own thing. – Toby Speight Jun 10 '21 at 15:52
  • Fair enough. I guess one could reserve the maximum (`n * (n - 1) / 2`) for the shared version and divide that number by `omp_get_num_threads()` for the private ones, but one would have to benchmark, if it makes a significant, positive difference. – paleonix Jun 10 '21 at 15:59
  • Yes, would be worth benchmarking and comparing. FWIW, I get the result for `n = 50'000` in about 20 seconds, so performance is reasonable. – Toby Speight Jun 10 '21 at 16:00
  • 1
    While `std::vector::reserve()` doesn't seem to do much here, there are two other details that improved on your solution by almost a factor of 3 on my system: 1. use `schedule(dynamic)`, as the workload will be very unbalanced with the default static scheduling. 2. Don't use `std::hypot` here, as it will compute the squares in `double` format. Multiplying doubles is much more expensive than multiplying `int`s or even `long`s (which will be needed to prevent overflow). As this algorithm is not memory-bound (on my system), this kind of stuff is actually significant. – paleonix Jun 10 '21 at 23:10
  • 1
    I forgot that `std::hypot()` isn't overloaded for integers - good call. I was surprised that `schedule(dynamic)` makes a difference - I forgot that that the most time would be spent inside the `if` condition. I'll update the code with these changes. Thanks again! – Toby Speight Jun 11 '21 at 06:10
2

The count variable is an index into a. The reduction(+:a,count) operator is summing the array, it is not a concatenation operation which is what I think you are looking for.

The count variable needs to be surrounded by a mutex, something like #pragma omp critical, but I am not an OpenMP expert.

Alternatively, create int a[n][n], set all of them to -1 (a sentinel value to indicate "invalid") then assign the result of the sqrt() when it is near enough to a whole number.

Daniel Dearlove
  • 565
  • 2
  • 12
2

As an alternative to using a critical section, this solution uses atomics and could therefore be faster.

The following code might freeze your computer due to memory consumption. Be careful!

#include <cstdio>
#include <cmath>

#include <vector>

int main() {
    int const n = 100;
    // without a better (smaller) upper_bound this is extremely
    // wasteful in terms of memory for big n 
    long const upper_bound = 3L * static_cast<long>(n) *
                             (static_cast<long>(n) - 1L) / 2l; 
    std::vector<int> a(upper_bound, 0);
    int count = 0;

    #pragma omp parallel for schedule(dynamic) shared(a, count)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                int my_pos;
                #pragma omp atomic capture
                my_pos = count++;

                a[3 * my_pos] = i;
                a[3 * my_pos + 1] = j;
                a[3 * my_pos + 2] = static_cast<int>(std::round(k));
            }
        }
    }
    count *= 3;
    
    for(int i = 0; i < count; ++i) {
        std::printf("%d %s", a[i], (i + 1) % 3 ? "" : ", ");
    }
    printf("\ncount: %d", count);

    return 0;
}

EDIT: My answer was initially a reaction to a by now deleted answer using a critical section in a sub-optimal way. In the following I will present another solution which combines a critical section with using std::vector::emplace_back() to circumvent the need for upper_bound similar to Toby Speight's solution. Generally using a reduce clause like in Toby Speight's solution should be preferred over critical sections and atomics, as reductions should scale better for big numbers of threads. In this particular case (relatively few calculations will be written to a) and without a big amount of cores to run on, the following code might still be preferable.

#include <cstdio>
#include <cmath>

#include <tuple>
#include <vector>

int main() {
    int const n = 100;

    std::vector<std::tuple<int, int, int>> a{};
    
    // optional, might reduce number of reallocations
    a.reserve(2 * n); // 2 * n is an arbitrary choice

    #pragma omp parallel for schedule(dynamic) shared(a)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                #pragma omp critical
                a.emplace_back(i, j, static_cast<int>(std::round(k)));
            }
        }
    }
    long const count = 3L * static_cast<long>(a.size());
    
    for(unsigned long i = 0UL; i < a.size(); ++i) {
        std::printf("%d %d %d\n",
                    std::get<0>(a[i]), std::get<1>(a[i]), std::get<2>(a[i]));
    }
    printf("\ncount: %ld", count);

    return 0;
}
paleonix
  • 2,293
  • 1
  • 13
  • 29
  • 3
    2 corrections: std::vector a(n, 0); ---- std::vector a(n*(n-1)/2, 0); mypos should be my_pos in a[3 * mypos + 2] = static_cast(std::round(k)); – Laci Jun 10 '21 at 14:08
  • 1
    This version runs faster than @Toby Speight's one. n=20'000 g++ (10.2) -O3 -fopenmp Results: real 0m0.231s vs. 0m0.267s user 0m1.060s vs 0m2.443s – Laci Jun 10 '21 at 17:02
  • @Laci Even if this result were independent of hardware and `n`, it might still be different for the full code (vs minimal example) of the OP or any future reader. So to me it doesn't really matter which one is faster for the minimal example. – paleonix Jun 10 '21 at 17:16
  • OK. But there is a strange issue: if n=50'000 the counts are different.. 226383 vs. 260250 (3*86750). Note that, I have changed int const n = 100; to size_t const n = 50000; Any idea why? – Laci Jun 10 '21 at 17:23
  • 1
    OK. I have figured it out: 50000*50000 is bigger than the maximum of int – Laci Jun 10 '21 at 17:37