Parallel exection using OpenMP takes longer than serial execution c++, am i calculating execution time in the right way?

Question

Without using Open MP Directives - serial execution - check screenshot here

Using OpenMp Directives - parallel execution - check screenshot here

#include "stdafx.h"
#include <omp.h>
#include <iostream>
#include <time.h>
using namespace std;

static long num_steps = 100000;
double step;
double pi;

int main()
{
clock_t tStart = clock();
int i;
double x, sum = 0.0;
step = 1.0 / (double)num_steps;

#pragma omp parallel for shared(sum)
for (i = 0; i < num_steps; i++)
{
    x = (i + 0.5)*step;
#pragma omp critical
    {
        sum += 4.0 / (1.0 + x * x);
    }
}

pi = step * sum;
cout << pi <<"\n";
printf("Time taken: %.5fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
getchar();
return 0;
}

I have tried multiple times, the serial execution is always faster why?

Serial Execution Time: 0.0200s Parallel Execution Time: 0.02500s

why is serial execution faster here? am I calculation the execution time in the right way?

Remember it takes time to create threads and your algorithm does not take that much time. — drescherjm, Apr 20 '18 at 19:52
Lots of things can lead to parallel execution being slower than non-parallel. For example: the overhead of spinning up threads outweigh the work done in each thread. The cost of synchronization outweighs the benefit of running in parallel. False sharing (due to bad/ignorant implementation) kills performance of the threaded version. And much, much more. Threading is *hard*, *not* a panacea. — Jesper Juhl, Apr 20 '18 at 19:55
To answer your question: no you are not timing the execution time correctly. See https://stackoverflow.com/questions/13351396/c-timing-in-linux-using-clock-is-out-of-sync-due-to-openmp and several others for reasons why not to use `clock` to time parallel programs. — High Performance Mark, Apr 20 '18 at 20:01
Possible duplicate of [OpenMP time and clock() calculates two different results](https://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) — Zulan, Apr 20 '18 at 22:56
@Zulan, `clock()` is not likely the issue because the OP is using `stdafx.h` which is [precompiled header from Visual Studio](https://en.wikipedia.org/wiki/Precompiled_header#Microsoft_Visual_C_and_C++) and `clock()` does not have this problem with the MSVC runtime. It's only with Linux variants of the C library that have this issue. — Z boson, Apr 21 '18 at 07:06
@Zboson practically you are right. However, if you focus on *"am i calculation execution time in the right way?"*, I think the duplicate is justified. — Zulan, Apr 21 '18 at 09:15
Get rid of the critical pragma and replace `shared(sum)` with `reduction(+:sum) private(x)` — Z boson, Apr 22 '18 at 08:08

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-

a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)

b) When you create multi threads it needs context switching which also take time.

c) Need to release memory allocated to threads which also take time.

d) It depends on number of processors and total memory (RAM) in your machine

So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.

Using a critical for most of the computation, you aren't taking advantage of multiple threads. You appear to have a sufficiently long loop for effective omp parallel reduction. There must be several examples of this particular exercise posted on line. — tim18, Apr 21 '18 at 11:30

sv90 · Answer 2 · 2018-04-22T21:02:36.080

Because of your critical block you cannot sum sum in parallel. Everytime one thread reaches the critical section all other threads have to wait.

The smart approach would be to create a temporary copy of sum for each thread that can be summed without synchronization and afterwards to sum the results from the different threads. Openmp can do this automatically for with the reduction clause. So your loop will be changed to.

#pragma omp parallel for reduction(+:sum)
for (i = 0; i < num_steps; i++)
{
    x = (i + 0.5)*step;
    sum += 4.0 / (1.0 + x * x);
}

On my machine this performs 10 times faster than the version using the critical block (I also increased num_steps to reduce the influence of one-time actions like thread-creation).

PS: I recommend you you to use <chrono>, <boost/timer/timer.hpp> or google benchmark for timing your code.

Add `private(x)` as well since `x` is defined outside of the parallel region it is shared unless you explicitly declare it private. — Z boson, Apr 23 '18 at 06:30
For timing OpenMP omp_get_wtime() is a simple approach (which doesn't need any other headers or code beyond what you must already have) — Jim Cownie, Apr 23 '18 at 08:50

Parallel exection using OpenMP takes longer than serial execution c++, am i calculating execution time in the right way?

2 Answers2