Tips and tricks on improving Fortran code performance

Question

As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model simulation takes hours to a few days to complete when run in MPI on dozens of CPUs. Naturally, improving model efficiency as much as possible is important, while making sure the results are byte-to-byte identical.

While I feel quite comfortable with my Fortran programming, and am aware of quite some tricks to make code more efficient, I feel like there is still space to improve, and tricks that I am not aware of.

Currently, I make sure I use as few divisions as possible, and try not to use literal constants (I was taught to do this from very early on, e.g. use half=0.5 instead of 0.5 in actual computations), use as few transcendental functions as possible etc.

What other performance sensitive factors are there? At the moment, I am wondering about a few:

1) Does the order of mathematical operations matter? For example if I have:

a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c

would d evaluate with different efficiency based on the order of multiplication? Nowadays, this must be compiler specific, but is there a straight answer? I notice d getting (slightly) different value based on the order (precision limit), but will this impact the efficiency or not?

2) Passing lots (e.g. dozens) of arrays as arguments to a subroutine versus accessing these arrays from a module within the subroutine?

3) Fortran 95 constructs (FORALL and WHERE) versus DO and IF? I know that these mattered back in the 90's when code vectorization was a big thing, but is there any difference now with modern compilers being able to vectorize explicit DO loops? (I am using PGI, Intel, and IBM compilers in my work)

4) Raising a number to an integer power versus multiplication? E.g.:

b=a**4

or

b=a*a*a*a

I have been taught to always use the latter where possible. Does this affect efficiency and/or precision? (probably compiler dependent as well)

Please discuss and/or add any tricks and tips that you know about improving Fortran code efficiency. What else is out there? If you know anything specific to what each of the compilers above do related to this question, please include that as well.

Added: Note that I do not have any bottlenecks or performance issues per se. I am asking if there are any general rules for optimizing the code in sense of operations.

Thanks!

It's impossible to give a laundry list of all the things that could be slow in your code. As others have mentioned, profile it. Moreover, are you writing your own PDE solver? Unless that's part of your research, it's better to get one that's already as tricked out as can be. The state of modern compilers, profilers, numerical libraries, and tricks with memory are such that it's better to be familiar with these and to focus on specific questions about what is still slow, than to just look for a laundry list. — Iterator, Oct 15 '11 at 22:03
I see this all the time. People asking "will the optimizer do this ..." when what you see in the code is `call dgemm('n','n', ...)`. It is just *assumed* that that bugger must be as optimal as possible when actually, for reasonable-size matrices, it spends most of its time calling a function to classify those character flags. — Mike Dunlavey, Oct 15 '11 at 22:27
@Iterator - Personally, I wouldn't (unless in some very special commercial cases; Ph.D. is not one of them) even go down the path of optimizing today's num. libraries. One's time could be put to better use, and the time savings in running program will be marginal. Programmer's time is several times more expensive than the machine's. — Rook, Oct 15 '11 at 22:58
@Idigas: We're in agreement. Though, the truth is it depends on how many machines you're using. :) — Iterator, Oct 16 '11 at 00:14

score 12 · Answer 1 · edited Oct 30 '14 at 16:12

Sorry but all the tricks you mentioned are simply ... ridiculous. More exactly, they have no meaning in practice. For instance:

what could be the advantage of using half(=0.5) instead of 0.5?
idem for computing a**4 or a*a*a*a. (a*a)** 2 would be another possibility too. My personal taste is a**4 because a good compiler which choose automatically the best way.

For **, the only point which could matter is the difference between a ** 4 and a ** 4., the latter being much more CPU time consuming. But even this point has no sense without a measurement in an actual simulation.

In fact, your approach is wrong. Develop your code as well as possible. After that, measure objectively the cost of the different parts of your code. Optimizing without measuring before is simply non sense.

If a part exhibits a high percentage of the CPU, 50% for instance, don't forget that optimizing that part only cannot divide the cost of the overall code by a factor greater than two. Any way, start the optimization work by the most expensive part (the bottle neck).

Don't forget also that the main improvements are generally coming from better algorithms.

Thanks for the answer. I agree about bottlenecks and improving algorithms, however that is a different issue and we are happy where we are with it. The specific code is almost near first public release, but my question is more related towards low-level operations and compiler/processor implementations. — milancurcic, Oct 15 '11 at 19:30

score 10 · Answer 2 · edited May 23 '17 at 12:32

I second the advice that these tricks that you have been taught are silly in this era. Compilers do this for you now; such micro-optimizations are unlikely to make a significant difference and may not be portable. Write clear & understandable code. Carefully select your algorithm. One thing that can make a difference is using indices of multi-dimensions arrays in the correct order ... recasting an M X N array to N X M can help depending on the pattern of data access by your program. After this, if your program is too slow, measure where the CPU is consumed and improve only those parts. Experience shows that guessing is frequently wrong and leads to writing more opaque code for nor reason. If you make a code section in which your program spends 1% of its time twice as fast, it won't make any difference.

Here are previous answers on FORALL and WHERE: How can I ensure that my Fortran FORALL construct is being parallelized? and Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

I don't remember some of these things being important even in the 70s. Agree on the answer though; one should never guess where optim. is needed. — Rook, Oct 15 '11 at 22:59

score 9 · Accepted Answer · edited May 23 '17 at 12:17

You've got a-priori ideas about what to do, and some of them might actually help, but the biggest payoff is in a-posteriori anaylsis.
(Added: In other words, getting a*b*c into a different order might save a couple cycles (which I doubt), while at the same time you don't know you're not getting blind-sided by something spending 1000 cycles for no good reason.)

No matter how carefully you code it, there will be opportunities for speedup that you didn't foresee. Here's how I find them. (Some people consider this method controversial).

It's best to start with optimization flags OFF when you do this, so the code isn't all scrambled. Later you can turn them on and let the compiler do its thing.

Get it running under a debugger with enough of a workload so it runs for a reasonable length of time. While it's running, manually interrupt it, and take a good hard look at what it's doing and why. Do this several times, like 10, so you don't draw erroneous conclusions about what it's spending time at.

Here's examples of things you might find:

It could be spending a large fraction of time calling math library functions unnecessarily due to the way some expressions were coded, or with the same argument values as in prior calls.
It could be spending a large fraction of time doing some file I/O, or opening/closing a file, deep inside some routine that seemed harmless to call.
It could be in a general-purpose library function, calling a subordinate subroutine, for the purpose of checking argument flags to the upper function. In such a case, much of that time might be eliminated by writing a special-purpose function and calling that instead.

If you do this entire operation two or three times, you will have removed the stupid stuff that finds its way into any software when it's first written. After that, you can turn on the optimization, parallelism, or whatever, and be confident no time is being spent on silly stuff.

Thanks, this is helpful, will keep it in mind. – milancurcic Oct 16 '11 at 01:21 — milancurcic, Oct 16 '11 at 01:21

Tips and tricks on improving Fortran code performance

3 Answers3

Linked