6

I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?

My problem is simple:

I have a DO loop that has elements that may be run concurrently. Which method do I use ?

Below is code to generate particles on a simple cubic lattice.

  • npart is the number of particles
  • npart_edge & npart_face are that along an edge and a face, respectively
  • space is the lattice spacing
  • Rx, Ry, Rz are position arrays
  • x, y, z are temporary variables to decide positon on lattice

Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.

So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :

DO CONCURRENT (i = 1, npart)
    x(i) = MODULO(i-1, npart_edge)
    Rx(i) = space*x(i)
    y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
    Ry(i) = space*y(i)
    z(i) = (i-1) / npart_face
    Rz(i) = space*z(i)
END DO

Or do I use OpenMP?

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
    x = MODULO(i-1, npart_edge)
    Rx(i) = space*x
    y = MODULO( ( (i-1) / npart_edge ), npart_edge)
    Ry(i) = space*y
    z = (i-1) / npart_face
    Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL

My tests:

Placing 64 particles in a box of side 10:

$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out 
CPU time =  6.870000000000001E-003
Real time =  3.600000000000000E-003

$ ifort -real-size 64 concurrent.f90 
$ ./a.out 
CPU time =  6.699999999999979E-005
Real time =  0.000000000000000E+000

Placing 100000 particles in a box of side 100:

$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out 
CPU time =  8.213300000000000E-002
Real time =  1.280000000000000E-002

$ ifort -real-size 64 concurrent.f90 
$ ./a.out 
CPU time =  2.385000000000000E-003
Real time =  2.400000000000000E-003

Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.

physkets
  • 183
  • 3
  • 12
  • 1
    The assertion about x, y and z needing to be arrays in the DO CONCURRENT case is not a language requirement. How DO CONCURRENT is implemented also depends very much on compiler capability - lazy compilers may just implement it as an ordinary serial loop without any vectorization or parallelization. So the answer is... "It depends." – IanH Jul 24 '16 at 07:58
  • 1
    @IanH What do you mean when you say that it is not a language requirement? I say that they have to be arrays because otherwise, those operations cannot be done concurrently. Also, I've added as an edit, info on performance. – physkets Jul 24 '16 at 08:38
  • 1
    So what is the question now? DO CONCURRENT vs. OpenMP in general? Or how do I make this piece of code run faster? These are to VERY different questions. The general answer is to use what is faster, of course. – Vladimir F Героям слава Jul 24 '16 at 08:45
  • And I can't see why you are using REDUCTION at all. – Vladimir F Героям слава Jul 24 '16 at 08:48
  • And 64 particles is extremely small. For threads you should have 10000 or 1000000. Starting a thread just for 8 particles 8 is a joke. You should time the relevant loop only too. – Vladimir F Героям слава Jul 24 '16 at 08:50
  • The question is to understand what circumstances favour one over the other. Oh, true, you're right. REDUCTION is not necessary here; let me check with it removed. I am only timing the relevant loop. – physkets Jul 24 '16 at 08:56
  • The semantics of DO CONCURRENT do not require those variables to be arrays. A variable may be referenced in an iteration if it is defined previously in the same iteration or not defined in any iteration (F2008 8.1.6.7p1). If the compiler is to execute the loop concurrently (it is not required to do so) then it must do the necessary analysis to determine that a variable that is defined then referenced in an iteration needs to be the equivalent of OpenMP PRIVATE. – IanH Jul 24 '16 at 09:02
  • Checked without the REDUCTION clause. Not much of a difference. – physkets Jul 24 '16 at 09:02
  • @IanH, all I'm saying is that here. in this case, for them to be the equivalent of PRIVATE, they have to be arrays. I am not making a general statement. If they are not, concurrent operations, here, wouldn't be possible. Do you disagree? – physkets Jul 24 '16 at 09:06
  • I disagree. The language does not require those variables to be arrays. Read the semantics of DO CONCURRENT in the Fortran 2008 standard, plus corrigenda, carefully. DO CONCURRENT has been somewhat unfortunately named - it does not actually mean or require that iterations are to be done concurrently. – IanH Jul 24 '16 at 09:16
  • 1
    @IanH okay, I read it and now I understand. So all it means is that the processor is allowed to perform the iterations in arbitrary order, which may or may not be concurrent. It really is a case of bad naming. – physkets Jul 24 '16 at 10:02
  • 2
    Just because we have a shiny new feature: some [documentation](http://stackoverflow.com/documentation/fortran/1657/execution-control/8820/block-do-construct#t=201607241116577395688) of `do concurrent`. – francescalus Jul 24 '16 at 11:18

1 Answers1

6

DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.

OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.

You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.

If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.

My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.


Your specific examples and the performance measurement:

Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).

npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002

npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004

npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004

Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.

  • I've added the info you requested earlier as an edit. Also, I do use openmp elsewhere in my code, but there, concurrency will not help. – physkets Jul 24 '16 at 08:36
  • ifort -qparallel requests parallelization of do concurrent. With the options given above, there would be only simd vectorization, which appears to be more efficient for your problem size. If your inner loop length is at most 100, splitting it into threaded chunks would remove most of the advantage of simd vectorization. BLOCK may be used within DO CONCURRENT to avoid making arrays for local variables. – tim18 Jul 24 '16 at 12:18
  • Guessing that you run with hyperthreads enabled, you will need to check parallel performance with fewer than default number of threads, setting OMP_PLACES=cores, before you conclude that parallel is too inefficient. There is little reason to expect do concurrent auto-parallel to perform different from OpenMP. – tim18 Jul 24 '16 at 12:26
  • @tim18 Better comment under the question. This answer was written before the details were revealed and does not reflect them at all. – Vladimir F Героям слава Jul 24 '16 at 12:31
  • @tim18 So if I put a `BLOCK` inside the `DO` loop, the local vars of each iteration are independent? Is that more efficient than making an array? Also, why should I set omp to not use hyperthreads? – physkets Jul 24 '16 at 19:49
  • @physkets It would be better to discuss it under the question and not this answer. Declaring it in a block will have the same effect as declaring it at the top of the procedure and then making it `private`. With an array ou risk false sharing https://en.wikipedia.org/wiki/False_sharing – Vladimir F Героям слава Jul 24 '16 at 20:35
  • @physkets And why not hyperthreads? Because it often slows down the code. If you have 4 cores, making it use 8 hyperthreads does not magically bring new cores. It is a complex topic. Please do not discuss it here. Ask a full question if you want. If I delete myanswer, all these comments will be gone. – Vladimir F Героям слава Jul 24 '16 at 20:41
  • @VladimirF Can a mod copy and paste these comments up there? – physkets Jul 25 '16 at 04:57
  • Nope, noone can. Comments are not meant to carry important information and survive here. – Vladimir F Героям слава Jul 25 '16 at 07:26
  • Is do concurrent useful in 2021, compared to simple do loop? Does it serve any performance benefits? – Eular Oct 04 '21 at 10:11
  • @Eular If you use automatic parallelization, it makes it simpler for the compiler. I am sure you will be find cases where it is necessary for the automatic parallelizer. Whether it is better than other parallelization approaches is a matter for debate, but some people are very enthusiastic about it and the automatic parallelizability and offloadability. A full discussion is better to be done elsewhere. – Vladimir F Героям слава Oct 04 '21 at 12:51