Allocatable arrays performance

Question

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread. I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.

Here is a simple example which illustrates what I was talking about (below is a pseudocode):

"SOURCE FILE":

SUBROUTINE ZEROSET()
   INCLUDE 'FILE_1.INC'
   INCLUDE 'FILE_2.INC'
   INCLUDE 'FILE_3.INC'
   ....
   INCLUDE 'FILE_N.INC'

   ARRAY_1 = 0.0
   ARRAY_2 = 0.0
   ARRAY_3 = 0.0
   ARRAY_4 = 0.0
   ...
   ARRAY_N = 0.0
END SUBROUTINE

As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that

REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)

Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive. When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like

REAL, ALLOCATABLE:: ARRAY_I(:,:,:)

And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).

Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.

I hope that everything is clear now. I appreciate you help very much.

How should this be related to the parallel execution? Is the performance drop not already visible in a serial execution? How often to you allocate? Do the dummy arguments have the allocatable attribute? — haraldkl, Sep 09 '11 at 11:48
It is not likely your problem is mpi related. Are you doing shared or distributed memory? You might be having a memory bottleneck, but if you did everything right, this is not the case, since the COMMON block version of the code works fine. Please include sample code of how and what you allocate. If you allocate and deallocate your arrays once (start and end of program, respectively), you should not see a decrease in performance. — milancurcic, Sep 09 '11 at 14:16
performance was 'awful' is not very quantitative, how much slower? like said above, please verify your problem on a serial run first. — steabert, Sep 09 '11 at 18:56
Guys, I am pretty sure that the problem has nothing to do with parallel programming or MPI. That is why my question's title does not contain MPI-related words. I was just trying to be precise and describe the program's environment as well. Here is a simple example which illustrates what I was talking about (sorry for not providing it earlier) P.S. Cannot find how to show a piece of code in the comment, so I describe the example just in my initial post — TruLa, Sep 10 '11 at 11:49

Kurt Glaesemann · Answer 1 · 2011-09-09T17:50:00.610

Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   call Z(A,N)
   end subroutine

   subroutine Z(A,N)
   Integer N
   real A(N,N)
   do stuff here
   end

Then this code:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   do stuff here
   end subroutine

The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.

This should get you most of your performance back.

Common blocks are located in a specific place in memory also, and that allows optimizations also.

score 0 · Answer 2 · edited Dec 12 '12 at 19:42

0

I could think of just these reasons when it comes to fortran performance using arrays:

arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

edited Dec 12 '12 at 19:42

astay13

6,857
10
41
56

answered Sep 09 '11 at 19:08

steabert

6,540
2
26
32

The link is broken, so I submitted an edit with what I believe to be the same article. Please fix it if I am wrong. – astay13 Dec 12 '12 at 19:41

score 0 · Answer 3 · answered Sep 10 '11 at 18:48

Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved. Is this routine called very often, or why do you care about these 0.18 s? If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...

Allocatable arrays performance

3 Answers3