0

I have the following C++ code that multiply two array elements of a large size count

double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
    lg += *pA1-- * *pA2--;  
}

Is there a way that I can implement parallelism for the code?

user1998863
  • 165
  • 1
  • 10
  • 2
    [`std::reduce`](https://en.cppreference.com/w/cpp/algorithm/reduce). Use one of the versions that takes an `ExecutionPolicy`. – Pete Becker Jan 25 '22 at 14:55
  • 2
    `register int r = mm` [https://stackoverflow.com/a/3207129/487892](https://stackoverflow.com/a/3207129/487892) – drescherjm Jan 25 '22 at 14:55
  • 3
    `*pA1--` -> `*pA1++` ? – 463035818_is_not_an_ai Jan 25 '22 at 14:58
  • 2
    @PeteBecker `std::reduce` is for single range, [`std::inner_product`](https://en.cppreference.com/w/cpp/algorithm/inner_product) does the trick tho. – Yksisarvinen Jan 25 '22 at 14:59
  • There are many ways: STL, OpenMP, TBB, other libraries, ... The right one will depend on the context. so you will have to be more specific to get a meaningful answer. – paleonix Jan 25 '22 at 15:00
  • @Yksisarvinen -- you're right, of course. – Pete Becker Jan 25 '22 at 16:04
  • Note that not all compilers supports Parallel STL yet. For example Clang does not support it. Also note that your loop will likely not be vectorized if you do not tune compiler flags (which should result up to a x4 performance boost on mainstream x64 machines and up to x8 on Intel servers). Otherwise the operation is mainly memory bound so using many thread will not be much faster (except on NUMA platforms). – Jérôme Richard Jan 25 '22 at 17:25
  • @Yksisarvinen, std::inner_product does not seem to have parallel version though. By the way, I am using Visual C++ 2022. Speed is the key here to calculate the product of two huge arrays of the same size. – user1998863 Jan 26 '22 at 02:12
  • AFAIK, MSVC should use TBB under the hood. You can do the product using a simple `std::transform`. The thing is this is not very efficient due to a temporary container being filled and AFAIK this is a fundamental issue of the STL unless you write your own iterator which is very cumbersome. C++20 hopefully fix this with (lazy) ranges. However, I am not sure ranges currently support the parallelism TS yet. – Jérôme Richard Jan 26 '22 at 09:40

2 Answers2

2

Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):

double dot_prod_parallel(double* v1, double* v2, int dim)
{
    TimeMeasureHelper helper;
    double sum = 0.;

    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < dim; ++i)
        sum += v1[i] * v2[i];

    return sum;
}

GCC ad ICC are able to vectorize this loop in -O3. Clang 13.0 fail to do this, even with -ffast-math and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
0

I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.

// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
    TimeMeasureHelper helper;

    double sum = 0.;
    int i;
# pragma omp parallel shared(sum)
    {
        int num = omp_get_num_threads();
        int id = omp_get_thread_num();
        printf("I am thread #  % d of % d.\n", id, num);

        double priv_sum = 0.;
# pragma omp for
        for (i = 0; i < dim; i++)
        {
            priv_sum += v1[i] * v2[i];
        }

#pragma omp critical
        {
            cout << "priv_sum = " << priv_sum << endl;
            sum += priv_sum;
        }
    }
    return sum;
}
user1998863
  • 165
  • 1
  • 10
  • 2
    You do not need such a complicated code with OpenMP. You can use a simple OpenMP parallel reduction. This is simpler to write (3x less lines), easier to read (just 1 pragma) and faster on many-core platforms (due to an optimized tree-based reduction instead of a critical section just after a slow implicit barrier). Note that as said in the question comments, the code is not vectorized if you do not tell your compiler that double-FP is at least associative (some compilers require stricter assumptions). – Jérôme Richard Jan 27 '22 at 03:23
  • 1
    Jerome, can you post your code here? – user1998863 Jan 28 '22 at 00:37