0

So I am quite new to OpenMP, but I've recently started to play with it and now I want to use it in order to "speed-up" some calculations which I need to do. I have a function which depends on 3 variables and I want to evaluate it at some points in an interval, then show the values in a file. The thing is that the intervals for the 3 parameters are quite large, so I will have a lot of function evaluations. Using the 3-nested for loop is quite a pain if my interval is large.

The serial implementation is straight-forward, just make a 3-nested loop where each index i,j,k takes the value of the corresponding parameter (integer numbers from 1 to DIM, and evaluate the function in the point (i,j,k). In the OpenMP approach, obviously I thought of using the #pragma omp parallel for hoping that the program runtime will be faster.

Here is the code I wrote for serial implementation and the "parallel" one. Please keep in mind that DIM is set here to a smaller number just for testing purposes.

#include <iostream>
#include <chrono>
#include <omp.h>
#include <cmath>
#include <fstream>

using namespace std;

ofstream outparallel("../output/outputParallel.dat");
ofstream outserial("../output/outputSerial.dat");

const int spaceDIM = 80;

double myFunction(double a, double b, double c)
{
    return (double)a * log(b) + exp(a / b) + c;
}

void serialAlgorithmTripleLoop()
{
    auto sum = 0;
    auto timeStart = chrono::high_resolution_clock::now();
    for (int i = 1; i <= spaceDIM; ++i)
        for (int j = 1; j <= spaceDIM; ++j)
            for (int k = 1; k <= spaceDIM; ++k)
            {
                //sum += i * j * k;
                outserial << i << " " << j << " " << k << " " << myFunction((double)i, (double)j, (double)k) << endl;
            }
    auto timeStop = chrono::high_resolution_clock::now();
    auto execTime = chrono::duration_cast<chrono::seconds>(timeStop - timeStart).count();
    cout << "Serial execution time = " << execTime << " seconds";
    cout << endl;
    outserial << "Execution time = " << execTime << " seconds";
    outserial << endl;
}

void parallelAlgorithmTripleLoop()
{
    //start of the actual algorithm
    auto sum = 0;
    auto timeStart = chrono::high_resolution_clock::now();
#pragma omp parallel for
    for (int i = 1; i <= spaceDIM; ++i)
        for (int j = 1; j <= spaceDIM; ++j)
            for (int k = 1; k <= spaceDIM; ++k)
            {
                //   sum += i * j * k;
                outparallel << i << " " << j << " " << myFunction((double)i, (double)j, (double)k) << endl;
            }
    auto timeStop = chrono::high_resolution_clock::now();
    auto execTime = chrono::duration_cast<chrono::seconds>(timeStop - timeStart).count();
    cout << "Parallel execution time = " << execTime << " seconds";
    cout << endl;
    outparallel << "Execution time = " << execTime << " seconds";
    outparallel << endl;
}

int main()
{
    cout << "FUNCTION OPTIMIZATION" << endl;
    serialAlgorithmTripleLoop();
    parallelAlgorithmTripleLoop();
    return 0;
}

The output is unexpected for me: using the parallel approach I get longer execution time than the serial one. I also tried to use "reduction" and "ordered" and "collapsed" clauses from OMP standard, but none helped me. I'm running this on a 4-cores 8-threads laptop.

FUNCTION OPTIMIZATION
Serial execution time = 4 seconds
Parallel execution time = 7 seconds

Q: How can I properly speed up the evaluation of the function?

  • 3
    *"I thought of using the #pragma omp parallel for hoping that the program runtime will be faster"* - why would you assume that? OpenMP is not some kind of magical wand that makes serial code work in parallel all of it sudden. This example code clearly spends most of the time doing io by writing into `ofstream`. It is also likely that it suffers from UB when accessing `ofstream` object without sync. – user7860670 Sep 24 '19 at 09:26
  • 1
    The problem here is that you write to the same file from all threads without any synchronization. At least, these writes will be interleaved. You may try to create a single output file for each thread and then to combine all these files together. In case of parallelization of the outer-most loop and `schedule(static)`, this should preserve the order of written records (lines). Anyway, this also does not guarantee any speedup, since all the threads will use the same resource (disk), which might not be fast enough for concurrent data storage. Also, don't use `std::endl` for each written line. – Daniel Langr Sep 24 '19 at 09:29
  • @DanielLangr thank you for response. Would you mind explaining to me why does one should not use `std::endl` command? And do you think changing the method to evaluation the function in parallel manner and adding the results from each thread to an array making sure the `push_back` process is ordered (with `omp ordered`) would help? In that way I can just write the array to an output file outside the parallel region. – Robert Poenaru Sep 24 '19 at 11:12
  • @RobertPoenaru As for `std::endl`, look, e.g., here: [C++: “std::endl” vs “\n”](https://stackoverflow.com/questions/213907/c-stdendl-vs-n). And as for I/O, the problem is that the I/O part itself (writing results to a file) might take a majority of runtime. If so, you can parallelize the calculation itself, but it will not bring you any significant speedup. That's basically what [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) says. Consider, for instance, that sequential calculation takes 10 seconds and writing into a file takes 20 [s]. Your speed-up then cannot go below 1.5. – Daniel Langr Sep 24 '19 at 11:32
  • @RobertPoenaru Try to measure the runtime of your program with and without the file-storage and compare these numbers. You may find out that the runtime of the calculations themselves is much shorter. If you do this experiment, just don't forget to avoid optimizing the calculations completely away (you can, e.g., sum the results of function calls together and print it at the end). I/O is typically very slow (in comparison with CPU calculations), especially if you use text-based I/O (instead of binary-based). – Daniel Langr Sep 24 '19 at 11:37

0 Answers0