0

Our parallel Fortran program is running more than twice slower after updating the OS to Ubuntu 14.04 and rebuilding with Gfortran 4.8.2. To measure which parts of the code were slowed down is unfortunately not possible any more (not without downgrading the OS) since I did not save any profiling information for gprof when compiling under the old OS.

Because the program does lots of matrix inversions, my guess is that one library (LAPACK?) or programming interface (OpenMP?) has been updated between Ubuntu 12 and 14 in a way that slows down everything. I believe this is a general problem which could be already known to somebody here. Which is the solution to get back to a fast Fortran code, besides downgrading back to Ubuntu 12 or 13?

All libraries were installed from the repositories with apg-get and thus,they should have be also upgraded when I upgraded the system with apt-get dist-upgrade, I could, however, check if they are indeed the latest versions and/or build them from scratch.

I followed Steabert's advice and profiled the present code: I recompiled with gfortran -pg and checked the performance with gprof. The program was suspiciously slow when calling some old F77 subroutines, which I translated to f90 without performance improvement. I played with the suggested flags and compared the time of one program iteration: Flags -fno-aggressive-loop-optimizations, -llapack and -lblas did not yield any significant performance improvement. Flags -latlas, -llapack_latlas and -lf77blas did not compile (/usr/bin/ld: cannot find -lf77blas, etc.), even though the libraries exist and are in the right path. Both the compiler flags playing and performance analysis suggest that my first guess (the slowing down being related to matrix inversions, LAPACK, etc.) was wrong. It rather seems that the slowing down is in a part of the code where no heavy linear algebra is performed. Using objdump my_exec -s I have found out that my program was originally compiled with gfortran 4.6.3 before the OS upgrade. Instead of using the present gfortran (4.8.2). I could try now to compile the code with the old compiler.

nukimov
  • 216
  • 1
  • 7
  • 2
    Without the code in hand this will be difficult to answer. Did you measure where is your bottleneck and which parts of the code were slowed down? If not, do it now. – Vladimir F Героям слава Nov 04 '14 at 10:06
  • 2
    And of course nothing stops you from building your own LAPACK and BLAS. Make sure you use some optimized BLAS library, like ATLAS, or OpenBLAS. They may be even present in your repositories. – Vladimir F Героям слава Nov 04 '14 at 10:07
  • Regarding your edits: **Which** libraries do you use? Most importantly, which BLAS implementation do you use? Search for "BLAS" in your package manager, there will be more of them. – Vladimir F Героям слава Nov 04 '14 at 14:11
  • Thanks for your help. aptitude says my blas version is 1.2.20110419-7, I understand it is the last version. Upgrading atlas and lapack did not help. I'm also checking the performance of the code right now, it seems particularly slow when calling F77 routines, perhaps the whole issue is related to the f77 library... – nukimov Nov 04 '14 at 14:50
  • BLAS and ATLAS are 2 implementations of the same library. It is important for you to know, which one of them is enabled as the primary one. Which library do you link? Try to exchange `-lblas` with `-latlas`. – Vladimir F Героям слава Nov 04 '14 at 14:56
  • There are some aggressive optimizations that are enabled as of version 4.8, so you can try to use the `-fno-aggressive-loop-optimizations` flag to compile. This was the biggest impact for us with moving to version 4.8, so I thought I would just mention that, even though it might have nothing to do with your problem. – steabert Nov 04 '14 at 15:58
  • 1
    also, even if you can't determine what was slowed down, you can still determine what takes most time in the newly compiled code. – steabert Nov 04 '14 at 16:00
  • 1
    I would first ask if you are using MPI, and have the wrong number of processors. And, as @steabert said, you can still determine what takes the most time. If you're using *gprof*, it's not likely to tell you much. [*This explains the method I use.*](http://scicomp.stackexchange.com/a/2719) In packages like BLAS and LAPACK, a small number of operations on large matrices can be fast, but a large number of operations on small matrices can spend most of their time in avoidable miscellany, like checking arguments. – Mike Dunlavey Nov 07 '14 at 17:24
  • Thanks for the reply, Mike. We are not using MPI, all our CPUs share the same memory. I'll try your random pausing to see if I can find out where the program is getting slow. I guess this is only possible with a debugger program like GDB. – nukimov Nov 08 '14 at 09:42
  • depending on the time of your profiling test, you could also use valgrind. Compile your program with debugging symbols enabled, then run valgrind with something like `--tool==callgrind --dump-instr=yes`. This can give you detailed information of where in the subroutine(s) most time is spent. – steabert Nov 08 '14 at 13:53

2 Answers2

1

Probably this is not 100% a satisfactory answer, but it solved my performance problem. So here it is:

I decided to use GDB (Valgrid did not work for me): I compiled with the flag -g, executed with “gdb myprogramname”, typed “run” in the GDB prompt to execute the program, paused with ctr+C, checked what the threads were doing with “info threads” and continued with “continue”. I did this several times, randomly, to make some kind of statistic where the program was spending most of the time. This soon confirmed what gprof found before, i.e., that my program was investing lots of time in the function I translated to f90. However, now I also found out that the mathematical operation which was taking particularly long time inside this function was exponentiation, as suggested by the call to the C function e_powf.c. My function (the equation of state of sea water) has lots of high order polynomials with terms like T**3, T**4. To avoid calling e_powf.c and see if this improved the performance of the code I changed all the terms of the type T**2 to T*T; T**3 to T*T*T, etc. Here is a function's extract how it was before:

!          RW =     999.842594 + 6.793952E-2*T - 9.095290E-3*T**2+ 1.001685E-4*T**3 - 1.120083E-6*T**4 + 6.536332E-9*T**5

and how it is now:

 RW =     999.842594 + 6.793952E-2*T - 9.095290E-3*T*T+ 1.001685E-4*T*T*T - 1.120083E-6*T*T*T*T + 6.536332E-9*T*T*T*T*T

As a result, my program is running again twice as fast (i.e., like before I upgraded the OS). While this solves my performance problem, I cannot be 100% sure if it is really related to the OS upgrade or compiler change from 4.6.3 to 4.8.2. Though the present performance being similar to the pre-OS-upgrade really suggests it should be.

Unfortunately, “locate e_powf” does not yield any results in my system, it seems as if the function is a binary part of the gfortran compiler, but the source code is not given along. By googling, it seems that e_powf.c itself does not seem to have been updated lately (I guess, by such occurrences in Internet like http://koala.cs.pub.ro/lxr/#glibc/sysdeps/ieee754/flt-32/e_powf.c), so if something changed from Ubuntu 12 to 14 or from gfortran 4.6.3 to 4.8.2 it seems rather something subtle in the way this function is used.

Because I found in internet some discussions about if using T*T instead of T**2 , etc should bring some performance improvement and most of the people seem skeptic about it (for instance: http://computer-programming-forum.com/49-fortran/6b042075d8c77a4b.htm ; or a closed question in stackoverflow: Tips and tricks on improving Fortran code performance) I double checked my finding, so I can say I'm pretty much sure that using products of a variable (and avoiding like this calling e_powf.c) is faster than exponentiation, at least with gfortran 4.8.2 (which ever the reason).

Thanks a lot for all of you who commented, surely it helped me a lot and I learned plenty!

Community
  • 1
  • 1
nukimov
  • 216
  • 1
  • 7
  • 1
    This is VERY strange. Compilers should definitely optimize this themselves. Which optimizations did you enable? See http://goo.gl/T5Fm0p – Vladimir F Героям слава Nov 11 '14 at 11:24
  • I compiled like this: `gfortran -v -g -fbounds-check -fopenmp -o ahoi ahoi_load_data.f90 ahoi_globalvars.f90 ahoi_globalvarsOI.f90 ahoi_funcs.f90 woa_funcs.f90 ahoi_readcofile.f90 ahoi_main.f90 ahoi_paralind.f90 woa_basis.f90 ahoi_eta.f90 ahoi_dist.f90 ahoi_inverse.f90 ahoi_opest.f90 ahoi_corrclim.f90 ahoi_freezt.f90 ahoi_mapping.f90 ahoi_chnetcdf.f90 woa_adjust.f90 densitychange.f90 dgradcheck.f90 findsal.f findtemp.f ptemptotemp.f roundoff.f90 salest.f -I/usr/include/ -lnetcdff -lnetcdf` – nukimov Nov 11 '14 at 11:41
  • 1
    You do not optimize at all! Read https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html and http://en.wikipedia.org/wiki/Optimizing_compiler immediately! Be sure to try all levels from `-O1` to `-O5`! – Vladimir F Героям слава Nov 11 '14 at 12:01
  • Vladimir, thanks! This is really nice! I checked the 5 optimization flags, which really improved the speed. However, there were still significant differences with and without exponentiation. I measured the time for one program iteration. "Normal" code: 2:55 min. No exponentation: 1:03. Normal with -O1: 1:00. No exponentation with -O1: 30 seconds. Normal with O2: 56 seconds. No exponentation with -O2: 25 seconds. Further levels did not show significant changes (I tried up to O5 with a without exponentiation). It still looks like avoiding exponentiation speeds up the code for the same O level. – nukimov Nov 11 '14 at 13:28
0
RW =     999.842594 + 6.793952E-2*T - 9.095290E-3*T*T+ 1.001685E-4*T*T*T - 1.120083E-6*T*T*T*T + 6.536332E-9*T*T*T*T*T

IF you need further improvement you can write above line as:

T2=T*T
T3=T2*T
T4=T3*T
T5=T4*T
RW=999.842594 + 6.793952E-2*T - 9.095290E-3*T2+ 1.001685E-4*T3 - 1.120083E-6*T4 + 6.536332E-9*T5
neo
  • 1
  • 1