3

I have a question related to access times of multidimensional arrays or derived types. I wrote an algorithm which works quite well. However, the main part of this algorithm uses % to reference some data through different types. For example here is the most expensive loop in my code:

   do c_elem = 1 , no_elems
        call tmp_element%init_element(no_fl)
        tmp_element%Gij(no_fl,1) = e_Gxa(c_elem) 
        tmp_element%Gij(no_fl,2) = e_Gax(c_elem) 
        GAX = tmp_element%Gij(no_fl,1)/GAA
        GXA = tmp_element%Gij(no_fl,2)/GAA
        do k = 1 , no_fl-1
            s1 = this%flNew%Iij(k)
            i1 = this%nodes(s1)%fl_id
            tmp_element%Gij(k,1) = this%elOld(c_elem)%Gij(i1,1) + GAX*this%flNew%Gij(no_fl,k)
            tmp_element%Gij(k,2) = this%elOld(c_elem)%Gij(i1,2) + this%flNew%Gij(k,no_fl)*GXA
        enddo
        call this%elOld(c_elem)%init_element(no_fl)
        this%elOld(c_elem)%Gij = tmp_element%Gij
   enddo

As you see most of the time I access some variables through % and most of the arrays there are the "allocatable" ones, complex or integer. I've read that sometimes using derived types may lead to slow down of the computation, so I decided to rewrite my algorithm to work directly on arrays without types. The same loop which uses now just arrays looks like that:

    do c_elem = 1 , no_elems
        nGij1(no_fl,c_elem) = e_Gxa(c_elem)
        nGij2(no_fl,c_elem) = e_Gax(c_elem)
        GAX = e_Gxa(c_elem)/GAA
        GXA = e_Gax(c_elem)/GAA
        do k = 1 , no_fl-1
            i1 = fl_id(k)
            nGij1(k,c_elem) = oGij1(i1,c_elem) + GAX*fl_new_Gij(no_fl,k)
            nGij2(k,c_elem) = oGij2(i1,c_elem) + fl_new_Gij(k,no_fl)*GXA
        enddo
    enddo

I've expected that the code above will run faster than code where I used derived types. So I've checked the timings. Here are CPU times for this part of code:

  1. 0.1s for the derived types method (first code)
  2. 3.2s for the arrays method (second code) (~30 times slower )

I use only the -O2 option during the compilation and ifort Intel compiler. In the codes above no_elems >> no_fl and most of the arrays and types are complex. It is clear that the only difference in the both algorithms are the memory access times.

Why is there a huge difference in the execution time for both cases?

Edit #1

I've tried to make a short example, however I could not manage to obtain similar results like with the full code... the times for both cases were almost the same -- arrays were slightly faster.

I can show you the thing which is completely strange for me: I have a type:

type element
  complex*16,allocatable :: Gij(:,:) 
  integer    :: no_sites    
  complex*16 :: Gp,Gxa,Gax  
  integer    :: i,j
  contains
  procedure,pass(this) :: free_element
  procedure,pass(this) :: init_element
  procedure,pass(this) :: copy_element
end type element

Then, at some point of one big loop I calculate values like that:

  ! derived type way
  this%elOld(c_elem)%Gax = GAX
  this%elOld(c_elem)%Gxa = GXA
  this%elOld(c_elem)%Gp  = this%elOld(c_elem)%Gp + GXA*GAX/GAA
  ! array way to do it
  e_Gax(c_elem) = GAX
  e_Gxa(c_elem) = GXA
  e_Gp (c_elem) = e_Gp (c_elem) + GXA*GAX/GAA

where:

  complex(16),allocatable :: e_Gax(:),e_Gxa(:),e_Gp(:)

and elOld is an array of elements defined in another type

  type(element),allocatable :: elOld(:),elNew(:)

Then I have loop which I already showed you, but if I will do it in that way:

  tmp_element%Gij(no_fl,1) = e_Gxa(c_elem) 
  tmp_element%Gij(no_fl,2) = e_Gax(c_elem) 
  GAX = e_Gxa(c_elem)/GAA
  GXA = e_Gxa(c_elem)/GAA

instead of that:

  tmp_element%Gij(no_fl,1) = this%elOld(c_elem)%Gxa
  tmp_element%Gij(no_fl,2) = this%elOld(c_elem)%Gax
  GAX = tmp_element%Gij(no_fl,1)/GAA
  GXA = tmp_element%Gij(no_fl,2)/GAA

program becomes noticeable slower... (with type(elemen) :: tmp_element), which means that access to simple array e_Gxa(:) is slower than access to complex variable Gxa in the array of types i.e. this%elOld(:)%Gxa. Changing just few lines of the code may change visible execution time. I've checked it with gfortran but the results were the same. Here is the part of the full code which I'm talking about: modskminv_fast.f90. Sorry for posting full module, but I could not obtain the same results when I splitted the code to small pieces.

K Kolasinski
  • 340
  • 2
  • 11
  • 1
    Can you prepare a simplified example we can test? How can we make definite answers from such incomplete snippets of code! We don't even know what types are your variables and what are function calls and what are arrays. – Vladimir F Героям слава Feb 25 '16 at 20:03
  • 1
    But notice you are looping over the last index. – Vladimir F Героям слава Feb 25 '16 at 20:04
  • Hi, thank you for comment, I will try to make an example and put it on my github. I hope, I will manage to make it work with gfortran. What do you mean by the last index in which code #1 or #2 ? I think in the second code the iteration is ok, I followed by some tutorial in the internet. However I checked this does play the most important role in my case. – K Kolasinski Feb 25 '16 at 20:28
  • That is why I put just a part of the full code which is like 500 lines in total for one module, which I use in main.f90 which has additional 100 lines... – K Kolasinski Feb 25 '16 at 20:59
  • Ok, all are arrays there. I will try to check if I will managed to make a very short example, which will do just part of the job. – K Kolasinski Feb 25 '16 at 21:27
  • Ah yes, there are some functions too, but I've checked that they do not change timing a lot. – K Kolasinski Feb 25 '16 at 21:38
  • Just from a memory access point of view, your derived type implementation is probably closer to using Gij arrays with Gij(k,1:2, c_elem) than the two separate arrays you are using. No idea, whether this might be the problem here or not, but it appears to be the largest difference. If the two different arrays are located unfortunately in relation to each other, you might run into caching issues. – haraldkl Feb 28 '16 at 23:20
  • Hi, thanks for comment, I completely agree with you, however I wrote a small program which was suppose to simulate part of the algorithm, but the result was opposite and the arrays were faster... I will try to make just a pure subroutine for whole algorithm, I'm curious if this will change anything or not. – K Kolasinski Feb 29 '16 at 09:40

1 Answers1

0

The fault was obviously on my side. Just by accident I declared all the complex arrays as:

complex(16)

instead of

complex*16 or complex(kind=8)

Sorry for bothering you. Now, it works like a charm :)

K Kolasinski
  • 340
  • 2
  • 11