Performance issue while parallelizing using OpenMP

Question

I was trying to parallelize a code but it only deteriorated the performance. I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean.

      implicit none  
      integer, parameter :: n=100
      integer, parameter :: m=1000000
      real, parameter :: pi=3.141592654

      real MC,integ,x,y
      integer tid,OMP_GET_THREAD_NUM,i,j,init,inside  
      read*,init
      call OMP_SET_NUM_THREADS(init)
      call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC) 
!$OMP& REDUCTION(+:INTEG)
      do i=1,n
         inside=0
         do j=1,m
           call random_number(x)
           call random_number(y)

           x=x*pi
           y=y*2.0

           if(y.le.x*sin(x))then
             inside=inside+1
           endif

         enddo

         MC=inside*2*pi/m
         integ=integ+MC/n
      enddo
!$OMP END PARALLEL DO

      print*, integ
      end

As I increase the number of threads, run-time increases drastically. I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case.

I am running it on a 16 core processor using Intel Fortran compiler.

EDIT: The program after adding implicit none, declaring all variables and adding the private clause

Welcome. Please take the [Tour], it is recommended for all newcomers. How did you measure the time? — Vladimir F Героям слава, Jun 02 '17 at 13:43
Please see how I changed the indentation in your code. It is good to post all your code indented so that people can see the structure. It is good for you to keep your source files that way too. — Vladimir F Героям слава, Jun 02 '17 at 13:45
What is `inside`? Where does it come from? Please make sure you use `IMPLICIT NONE`. — Vladimir F Героям слава, Jun 02 '17 at 13:47
@VladimirF Thank you for the edit and suggestions. The program took a few seconds to run with a single thread and more than a minute using multiple threads so I did not use any inbuilt functions to measure time. `inside` is the number of random points which lie under the curve `x*sin(x)` where x goes from 0 to pi. The total area in which `m` random points are generated is 2*pi so the area under the curve is `inside*(2*pi)/m` which is stored in `MC` — Vishal Singh, Jun 03 '17 at 13:55

Vladimir F Героям слава · Answer 1 · 2017-06-16T22:22:27.840

0

You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?

Some compilers will use a fast algorithm that cannot be called in parallel. Some compilers will ave slow method but callable from parallel. Some will be both fast and allowed from parallel. Some will generate poor quality random sequences, some better.

You should use some parallel PRNG library. There are many. See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed.

To the old version:

You have a race condition there.

With

 inside=inside+1

more threads can be competing for writing and reading the variable. You will have to somehow synchronize the access. If you make it reduction you will have problems with

integ=integ+MC/n

if you make it private, then inside=inside+1 will only count locally.

MC also appears to be in a race condition, because more threads will be writing in it. It is not clear at all what MC does and why is it there, because you are not using the value anywhere. Are you sure the code you show is complete? If not, please see How to make a Minimal, Complete, and Verifiable example.

See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow.

edited Jun 16 '17 at 22:22

answered Jun 02 '17 at 13:55

Vladimir F Героям слава

57,977
4
76
119

`inside` should, by default, be a private variable. I wish the outer loop to be run in parallel such that the inner loop is independently run in different threads. `MC` stores the value of integration. `MC` is calculated n times independently and the mean is stored in `integ`. I also tried to remove `integ` so that everything is private just to see the time difference but it did not have any impact on the run time. – Vishal Singh Jun 03 '17 at 14:11
Why do you think it is default by private? You have no such setting in your code. By default everything is shared. I strongly believe the program is slow because the race condition. Did you compare the results of the serial and parallel code? Are they the same or do they differ? I bet they will differ. – Vladimir F Героям слава Jun 03 '17 at 14:22
I tried adding `DEFAULT(PRIVATE)` but it did not have any effect (that's why I thought it must be private by default, my bad) . Surprisingly, the results are same for both the serial and parallel code with or without using `DEFAULT(PRIVATE)`. Also the result is correct upto 3 decimal places. I also tried defining all the variables private using the `PRIVATE` clause but that also did not help. – Vishal Singh Jun 03 '17 at 17:04
Without a [mcve] noone can help you. It must be **compilable** and **complete**. – Vladimir F Героям слава Jun 03 '17 at 18:40
Although it was inspired from the problem in a bigger code, I am compiling this code itself using ifort without any issues. I do accept it is poorly written and might make it difficult to find the problem. I would also like to ask, though every iteration took almost one second on my machine using single thread, could it possible that there is some overhead problem. – Vishal Singh Jun 04 '17 at 00:51
If the code you show is the code you run then my answer is valid. Please read my answer. If you want anything more then 1. The code **MUST** conatin `implocot none`, 2. All variables **MUST** be declared. – Vladimir F Героям слава Jun 04 '17 at 04:18
I tried adding `impilicit none` and declaring all variables but there is no improvement in the performance. I also tried running the program given here http://jblevins.org/log/openmp but experience the same probelm – Vishal Singh Jun 14 '17 at 07:18
Please read the http://jblevins.org/log/openmp *very carefully*. What I see there is: *"Depending on your computer, because of the simplicity of this experiment you will probably need to set `n` very high in order to see the difference in execution speed. In fact, for small values of `n` (even 100000 in my case) the parallel code is actually **slower**."* – Vladimir F Героям слава Jun 14 '17 at 08:15
If you want anything more about **your code**, then you **MUST** edit the question and **show the complete** new **code** you are testing. – Vladimir F Героям слава Jun 14 '17 at 08:17
I set `n` to 1000000 such that the code took a few seconds to execute while running sequentially, but it took much more time when run in parallel – Vishal Singh Jun 16 '17 at 15:48

Performance issue while parallelizing using OpenMP

1 Answers1