6

I have a Fortran90 program (Packmol) which, until, was implemented with static memory allocation.

I changed the code to use dynamic allocation, in such a way that all arrays are allocated at the beginning. I had a performance loss of 400% in some examples.

Then, I verified that even if the sizes of the arrays are the same as those when I used static allocation, the problem persists. That is, if I change the allocations from double precision :: x(1000) to something like double precision, allocatable :: x(:) allocate(x(1000)) That is enough to cause the performance loss. Of course, when that is done for all arrays that need to be allocated dynamically, which are about 30.

Is there a way to allocate the arrays in a more efficient manner to reduce the performance penalty? Or do someone has a different suggestion?

Thank you very much.

EDIT: Somehow the problem was solved. The Dynamic version is now only slightly slower than the static version, which is expected. I do not know, really, what was causing the major slowdown before.

leandro
  • 175
  • 8
  • Well, I didn't put the code because it is a large program. The code is available here: https://github.com/leandromartinez98/packmol. There is a branch in which all the arrays are allocated dynamically, in the subroutine setsizes.f90. – leandro Mar 09 '16 at 23:42
  • Is that routine called often? Do allocated sizes change between calls? (Or are they set for the whole run of the program.) The cost of `allocate`-ing itself isn't that great at all. It depends on how it is used. – zdim Mar 09 '16 at 23:45
  • 2
    As you can see, it.is nit very likely someone wilk go through such a link. Even the present answer is just guessing. You should prepare a minimal exemple reproducing the problem Find the subroutine where the difference is large and play with it. – Vladimir F Героям слава Mar 09 '16 at 23:49
  • I looked, and as you say it is not small. I cannot pull all that and analyze how it works to discover how many times it is called, whether sizes can be feasible changed at run time, etc. – zdim Mar 09 '16 at 23:55
  • 3
    Profile your code - before and after the change to allocatable arrays. Make sure that you are testing a build that has a decent level of optimisation enabled and that does not have runtime debugging options, such as array bounds checks, enabled. – IanH Mar 10 '16 at 00:01
  • I didn't intend to be lazy nor I was expecting someone to really look at the code. The point is that the two-lines example I provided are meaningful of I what I'm doing. That is: all arrays are allocated once, at the very begining of the code. The problem is NOT the time of allocating/deallocating arrays. The problem is in the access of these arrays by routines that do the heavy computation. As I tried to explain, possibly without the detail needed, is that even I simply instead of allocating statically I allocate with the same dimension immediately after start, I get the slowdown. – leandro Mar 10 '16 at 00:24
  • My compiler here complains about argument mismatches in procedure invocations and substring positions exceeding string lengths. – IanH Mar 10 '16 at 00:44
  • OK. that helps. I didn't get it from the question itself. So these allocationns happen literally once in the whole program, somewhere by the very beginning. That is interesting. – zdim Mar 10 '16 at 01:08
  • I am getting drawn to this, I find it interesting. Questions: 1) Why do you need dynamical allocation, if it is once at the beginning? Are the sizes set right before, from cmdline or other parameters not known beforehand (files etc)? 2) Do you have an idea of what sizes are likely to come up, or a range for that? 3) How recent are your compilers? Are (some) things from 2008 standard OK to use? – zdim Mar 10 '16 at 04:38
  • 3
    Note that the code is non-conforming. Until that is fixed there isn't much point pondering this question - the variation may simply be due to a different execution path selected as a consequence of memory corruption or similar. – IanH Mar 10 '16 at 06:52
  • @IahH, thank you. Which compiler are you using? I have gfortran and ifort. Ifort complains about some write statements, but this is not important. – leandro Mar 10 '16 at 10:30
  • 1) The allocation must be "dynamical" only because I cannot anticipate the size of the problems, but from the problem data (the "number of atoms") I can from the beginning set all arrays. The sizes of the problems users are working with is increasing, and people is starting to have allocation memory issues, such as segmentation faults at the very start (which, I understand, mean that the system didn't allow the static allocation of all required arrays). They have to use the "ulimit -s unlimited" command, for instance. But many of the users are not programmers, do they have problems with that. – leandro Mar 10 '16 at 10:31
  • 2) I have an idea, and that is why I distribute a reasonable static allocation version of the code. But problems are getting bigger and I am getting more and more people complaining about memory issues. I hoped to fix that for good. – leandro Mar 10 '16 at 10:32
  • 3) I am using the latest GNU and Intel fortran compilers. My only concern with the compiler versions is that I would not like to provide a code that could present compilation problems from the user side. I distribute the source code, not the executables. – leandro Mar 10 '16 at 10:34
  • Alright, that helps. Optimization of loops may be hurt bad if the compiler has no clue about array sizes, and you do say that arrays are heavily used so that may come to tell. On the other hand, if memory is being pushed already than that's a little different ball game and performance can certainly be affected. It does seem that it would be good to know more. – zdim Mar 10 '16 at 10:50
  • @zdim: As soon as I can (not this week), I will try to write a version of the code as simple as possible, in which the slowdown is observed. When I do that I will post it, and perhaps learn something about what is going on. – leandro Mar 10 '16 at 12:00
  • I used ifort current release, with `/check:all /warn:all`, compiling the entire sequence of source files twice, with no clean in between, to allow complete interface checking. There's a type mismatch with one of the work arrays in a call to `easygencan`, then there are several (I gave up after three or so) substring-out-of-bounds issues associated with testing both the value of a substring and the substring index in the one logical expression. Fix those (and any others), then profile the two different memory allocation approaches. – IanH Mar 10 '16 at 19:12
  • @IanH: Sorry about my ignorance. How exactly did you compile the software? That is, how did you add those flags to the compilation procedure? I am not able here to reproduce these warnings. I have ifort 14.0.2. The only warnings I get are "Recommented relationship between field width..." in some write statements, which are no relevance for the computations. – leandro Mar 11 '16 at 12:44
  • @IanH: I fixed the the declaration of the wi() vector, which was wrong. Nothing changed in terms of computer time (the declaration was correct in the static allocation version of the code). I don't get "substring-out-of-bounds" messages, except if they are the same as the "Recommended relationship between field..." in write statements. These last cannot be of any importance. – leandro Mar 11 '16 at 14:40
  • I created [an issue](https://github.com/leandromartinez98/packmol/issues/1) on your github project with some details. Probably best to continue discussion about the specifics of the code there, once you are confident the code is conforming then this question can be further addressed here. – IanH Mar 12 '16 at 00:00

1 Answers1

3

There can be many reasons for such a performance loss :

1) Static arrays are always allocated on the BSS (see Where are static variables stored (in C/C++)?), whereas "allocated" arrays can be allocated on the heap or on the stack. Allocation on the stack is much faster than on the heap. A good compiler can generate code that will allocate as much as possible on the stack.

2) You may have allocate/deallocate statements in loops. Every memory allocation will take some time. A good compiler can avoid allocating physically some memory at every allocation, but instead re-use space that has been deallocated.

3) The compiler knows dimensions at compile time with static arrays, so it will do some additional optimizations.

4) If you have multi-dimensional arrays, the calculation of the address of the elements can't be done at compile time. For example, the address of A(5,6,7) is 5 + 6*n1 + 7*n1*n2 where n1 and n2 are the dimensions of A : A(n1,n2,n3). For static arrays, the compiler can optimize this part. Moreover, if dimension n1,n2,... is a power of 2, instead of doing an integer multiply the compiler will generate a bit shift which is 3x faster.

Number 3) is the most probable. You can leave some static arrays for arrays for which you know a reasonable upper-bound at compile time, and which are relatively small (<1000 elements roughly) and also inside routines that are called very often and which do a very small amount of work.

As a rule of thumb, only small arrays can be statically allocated : most of the 1D arrays, some small 2D arrays and tiny 3D arrays. Convert all the rest to dynamic allocation as they will probably not be able to fit in the stack.

If you have some frequent allocate/deallocates because you call a subroutine in a loop such as this:

 do i=1,10000000
    call work(a,b)
 end do

 subroutine work(a,b)
  ...
  allocate (c)
  ...
  deallocate (c)
 end

if c has always the same dimensions you can put it as an argument of the subroutine, or as a global variable that will be allocated only one before calling work:

 use module_where_c_is_defined

 allocate (c)
 do i=1,10000000
    call work(a,b)
 end do
 deallocate(c)

 subroutine work(a,b)
  use module_where_c_is_defined
  if (.not.allocated(c)) then
    stop 'c is not allocated'
  endif
  ...
 end
Community
  • 1
  • 1
Anthony Scemama
  • 1,563
  • 12
  • 19
  • Do you confuse static and automatic arrays? Static arrays are in static storage. They need larger lifetime than the stack (with the exception of the main function's stack) has. Often they are directly present in the executable binary, especially when they are initialized. – Vladimir F Героям слава Mar 09 '16 at 23:45
  • Thank you very much for your detailed answer. Actually I don't deallocate any vector at any moment in the program. All vectors are allocated at the beginning, in the main program, and never changed again, they are shared through modules or passed as arguments to subroutines in which they are declared with fixed dimensions. The compiler optimizations are, of course, one issue. I didn't think that that would be that critical in my code. Using ifort I get 100% slowdown when using dynamic memory allocation. I will try to allocate dynamically only the most critical arrays. – leandro Mar 09 '16 at 23:46
  • 1
    Static arrays, given the usual meaning of that term, are not typically put on the stack - they typically are put into a data or BSS segment. Perhaps you were thinking of automatic arrays, though these may be stored on the stack or heap, depending on the implementation. I am not aware of any implementations of allocatable arrays that store the array data (versus the descriptor for the array) on the stack - they all use the heap (I can only think of very limited circumstances where stack based storage of allocatables would even be possible). – IanH Mar 09 '16 at 23:46
  • I am curious about something very specific: the order in which the arrays are allocated makes any difference? Allocating many arrays in a single statement might be different from allocating them in multiple statements? Is there any way to optimize the position of the vectors in the memory at the moment they are allocated? Is it possible to mimic static memory allocation in the case all dimension are known at the at the same time of the execution? – leandro Mar 09 '16 at 23:49
  • @leandro These questions can easily deserve a separate question post (maybe each of them). – Vladimir F Героям слава Mar 09 '16 at 23:52
  • Dear Anthony. I changed the alloccation in order to allocate dynamically only the arrays dependent on the most critical dimensions. It was a good suggestion, but unfortunately, in my case, it didn't make much of a difference. The code went from running in 64 seconds to running in 60 seconds, while with completely static allocation I get 18 seconds (in a small test case). – leandro Mar 10 '16 at 00:26
  • 1
    @leandro I would think that allocating all at once is (far?) _more_ efficient. It should be better doing it as early as possible, too. I don't see any way that you can get such downgrade in performance that is not fixable. This has to be fixable. I would imagine that you may see minimal effect (if any), if everything is done right. – zdim Mar 10 '16 at 03:58
  • @VladimirF : I use `static` it is used in C (for DATA and BSS). My mistake! I thought BSS segments were allocated at the bottom of the stack but they have their own domain (http://stackoverflow.com/questions/93039/where-are-static-variables-stored-in-c-c). I edit the answer right now. – Anthony Scemama Mar 10 '16 at 08:28
  • @zdim. I agree with you. I didn't expect to get that slowdown for doing that. That is why I am trying to figure out the problem, and I came down to write a version of the code that simply allocates the arrays just after declaration with the same sizes they were declared in the static version. I trying to understand where the problem is, and I will probably learn important stuff about computing in the meanwhile. :-) – leandro Mar 10 '16 at 10:38