Using OpenMP critical and ordered

Question

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.

The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.

Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.

So my question is: why doesn't a critical section work in this case?

My code is below. All shared variables except np, tm, hm, gam are read-only.

EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!

TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.

!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
  DO i=1,nd        
!$OMP CRITICAL (main)
! Second loop over the data:
    DO j=1,nd

! The lag:
      zdis = z(j) - z(i)
      IF(zdis >= 0.0) THEN
        izl =  INT( zdis/dzlag+0.5)
      ELSE
        izl = -INT(-zdis/dzlag+0.5)
      END IF
 ! ---- SNIP ----

! Loop over all variograms for this lag:

      DO cur_variogram=1,nvarg
        variogram_type = ivtype(cur_variogram)

! Get the head and tail values:

        indx = i+(ivhead(cur_variogram)-1)*maxdim
        vrh   = vr(indx)
        indx = j+(ivtail(cur_variogram)-1)*maxdim
        vrt   = vr(indx)
        IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE

        ! ----- PROBLEM AREA -------
        np(ixl,iyl,izl,1)  = np(ixl,iyl,izl,1) + 1.   ! <-- This never fails
        tm(ixl,iyl,izl,1)  = tm(ixl,iyl,izl,1) + vrt  
        hm(ixl,iyl,izl,1)  = hm(ixl,iyl,izl,1) + vrh
        gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
        ! ----- END OF PROBLEM AREA -----

        !CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
      END DO
    END DO
    !$OMP END CRITICAL (main)
  END DO
!$OMP END DO
!$OMP END PARALLEL

Thanks very much in advance!

I would suggest trying to post a [SSCCE](http://sscce.org/) instead of an incomplete snippet. This will force you to analyze better what actually causes the problem and will give other people the possibility to reproduce it in their own environment. — Massimiliano, Jan 08 '14 at 11:05
*some of the results are off by a thousandth or so* Is that a measure of absolute error or of relative error ? If the former, what is the magnitude of the latter ? — High Performance Mark, Jan 08 '14 at 12:24
@Massimiliano Normally I would agree with you, however, it is not possible to provide a self-contained example in this case due to initialization and data in general. — Painless, Jan 08 '14 at 12:28
@HighPerformanceMark The error I reported was absolute. The relative error is very small (getting results such as 84.26539 vs 84.26538). However, the absolute error still seems way too big and I can't dismiss it. I also tested the ordering of the operations by randomly walking through the number range of the do loops and it produced exact results. I also doubt there's any SIMD magic behind this. — Painless, Jan 08 '14 at 12:49
@VladimirF Yes, that improves the results considerably, so thanks for the suggestion! — Painless, Jan 08 '14 at 14:16
Maybe you can try and preserve the order with OpenMP and Fortran. You basically need to make an array with a size for each thread and then write the partial sums you did in parallel to the array for each thread and then do a sum over the number of threads of the partial sums in serial. — Z boson, Jan 08 '14 at 14:33

score 3 · Accepted Answer · answered Jan 08 '14 at 13:09

3

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.

Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.

The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.

I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

answered Jan 08 '14 at 13:09

High Performance Mark

77,191
7
105
161

That was my original thought, but as I said in my previous comment, I walked through the two do loops randomly (that is: I randomized two numbers i,j between 1 and nd; if I had "visited" them before, re-roll; continue until all nd*nd combinations are visited) and it produced the exact result. Also, the OpenMP reduce statement (commented out in the posted snippet) should also produce different results every time, but does not. – Painless Jan 08 '14 at 13:25
I wouldn't agree that *the OpenMP reduce statement ... should also produce different results every time* but would agree that *the OpenMP reduce statement might not produce the same results every time*. An inspection of your code does not lead me to any other conclusions or suggestions. – High Performance Mark Jan 08 '14 at 13:33
That's interesting. I had not though of the associative floating order in OpenMP and how it can effect the results. – Z boson Jan 08 '14 at 13:49
Hmm, it does seem like the whole thing is messed up by the ordering, at least when using OpenMP, but it still does not explain the results in my own experiment where I randomly iterated through the indices. – Painless Jan 08 '14 at 14:20
Never mind, I'm an idiot; I didn't realize init_seed() automatically generated a seed, so it just happened that the results were correct for the default seed. Feeling a bit embarrassed right now B-) – Painless Jan 08 '14 at 14:30

score 2 · Answer 2 · edited May 23 '17 at 12:03

2

High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).

float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c);   //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000

But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end

Edit:

Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.

edited May 23 '17 at 12:03

Community

1
1

answered Jan 08 '14 at 13:47

Z boson

32,619
11
123
226

That's a pretty nifty idea, I'll make sure to use it if I can! – Painless Jan 08 '14 at 14:32
@Painless, I made a mistake. I confused associatvity with commuativity. The trick I used in that link was for an operator which was not commutative but was still associative. It does not work the other way around (which is what IEEE is). I updated my answer. – Z boson Jan 13 '14 at 20:03

Using OpenMP critical and ordered

2 Answers2

Linked