4

I am trying to understand what cache or other optimizations could be done in the source code to get this loop faster. I think it is quite cache friendly but, are there any experts out there that could squeeze a bit more performance tuning this code?

DO K = 1, NZ 
    DO J = 1, NY
        DO I = 1, NX

             SIDEBACK  = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
                         STEN(I  ,J-1,K-1) + STEN(I  ,J,K-1) + STEN(I  ,J+1,K-1) + &
                         STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) 

             SIDEOWN   = STEN(I-1,J-1,K)   + STEN(I-1,J,K)   + STEN(I-1,J+1,K) + &
                         STEN(I  ,J-1,K)   + STEN(I  ,J,K)   + STEN(I  ,J+1,K) + &
                         STEN(I+1,J-1,K)   + STEN(I+1,J,K)   + STEN(I+1,J+1,K)

             SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
                         STEN(I  ,J-1,K+1) + STEN(I  ,J,K+1) + STEN(I  ,J+1,K+1) + &
                         STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)

             RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0

        END DO
    END DO
END DO
JohnE
  • 29,156
  • 8
  • 79
  • 109
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • 2
    Just to make sure, this is Fortran, and STEN is a column major array? – szmate1618 Nov 01 '18 at 19:15
  • @szmate1618 Correct – Manolete Nov 01 '18 at 19:57
  • I doubt you can do very much to try to hand optimize the computation as you are striding all over memory. You can compute temporaries in for example the k-loop such as `kp1=k+1` and `km1=k-1` and use those in the inner loops, but a good compiler likely already hoists those expressions. – Steve Nov 01 '18 at 21:41
  • 1
    I think you can do more than that. The current code calculates every partial sum multiple times. I mean one iteration's `STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) ` will be the next iteration's `STEN(I,J-1,K-1) + STEN(I,J,K-1) + STEN(I,J+1,K-1) `, so no need to fetch and calculate again, you can store those partial results. I'm already working on a full implementation, will post it soon. – szmate1618 Nov 01 '18 at 22:20
  • 1
    Wait, I'm dumb. The Fortran compiler might be able do the same optimization I'm trying to do, if it unrolls the loop. If STEN is passed as an input parameter, it can very easily do that. Not having aliasing, extreme loop unrolling is what Fortran is famous for. – szmate1618 Nov 01 '18 at 23:09
  • As pointed out above, if you don't do any explicit naming of expressions to avoid repeat evaluation of expressions, compiler unrolling could enable a fair amount of it. To make this work, you will need to examine the unrolling options of your compiler and check their effect. For example, gfortran will not attempt unrolling unless specified, e.g. gfortran -funroll-loops --param max-unroll-times=4. ifort offers both compiler options and directives; you might try to persuade it to unroll 2 or more levels of loops by use of unroll and jam directives. – tim18 Nov 02 '18 at 01:20
  • In order for unrolling to recognize repeated expressions, you will need parens around each group of 3 terms. Another small thing: multiply by 1/27. rather than divide, if you aren't using a compiler option which does that automatically (subject to your application tolerating the slight reduction of accuracy). – tim18 Nov 02 '18 at 01:27
  • The added parentheses won't help ifort unless you have set one of the options to require they be treated according to standard, ( -standard-semantics or -assume protect_parens ). – tim18 Nov 02 '18 at 01:45
  • 1
    None of these are pointers, right? – Ross Nov 02 '18 at 17:32
  • @Ross None, they are arrays – Manolete Nov 02 '18 at 18:31

1 Answers1

5

Ok, I think I've tried everything I reasonably could, and my conclusion unfortunately is that there is not too much room for optimizations, unless you are willing to go into parallelization. Let's see why, let's see what you can and can't do.

Compiler optimizations

Compilers nowadays are extremely good at optimizing code, much much more than humans are. Relying on the optimizations done by the compilers also have the added benefit that they don't ruin the readability of your source code. Whatever you do, (when optimizing for speed) always try it with every reasonable combination of compiler flags. You can even go as far as to try multiple compilers. Personally I only used gfortran (included in GCC) (OS is 64-bit Windows), which I trust to have efficient and correct optimization techniques.

-O2 almost always improve the speed drastically, but even -O3 is a safe bet (among others, it includes delicious loop unrolling). For this problem, I also tried -ffast-math and -fexpensive-optimizations, they didn't have any measurable effect, but -march-corei7(cpu architecture-specific tuning, specific to Core i7) had, so I did the measurements with -O3 -march-corei7

So how fast it actually is?

I wrote the following code to test your solution and compiled it with -O3 -march-corei7. Usually it ran under 0.78-0.82 seconds.

program benchmark
implicit none

real :: start, finish
integer :: I, J, K
real :: SIDEBACK, SIDEOWN, SIDEFRONT

integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)

call cpu_time(start)
DO K = 1, NZ
    DO J = 1, NY
        DO I = 1, NX

            SIDEBACK =  STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1) + &
                        STEN(I  ,J-1,K-1) + STEN(I  ,J,K-1) + STEN(I  ,J+1,K-1) + &
                        STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)

            SIDEOWN =   STEN(I-1,J-1,K)   + STEN(I-1,J,K)   + STEN(I-1,J+1,K) + &
                        STEN(I  ,J-1,K)   + STEN(I  ,J,K)   + STEN(I  ,J+1,K) + &
                        STEN(I+1,J-1,K)   + STEN(I+1,J,K)   + STEN(I+1,J+1,K)

            SIDEFRONT = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1) + &
                        STEN(I  ,J-1,K+1) + STEN(I  ,J,K+1) + STEN(I  ,J+1,K+1) + &
                        STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)

            RES(I,J,K) = ( SIDEBACK + SIDEOWN + SIDEFRONT ) / 27.0

        END DO
    END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
!Print the original value as well, because one can never be too paranoid.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start

end program

Ok, so this is as far as the compiler can take us. What's next?

Store intermediate results?

As you might suspect from the question mark, this one didn't really work. Sorry. But let's not rush that forward. As mentioned in the comments, your current code calculates every partial sum multiple times, meaning one iteration's STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1) will be the next iteration's STEN(I,J-1,K-1) + STEN(I,J,K-1) + STEN(I,J+1,K-1), so no need to fetch and calculate again, you can store those partial results. The problem is, that we cannot store too many partial results. As you said, your code is already quite cache-friendly, every partial sum you store means one less array element you can store in L1 cache. We could store a few values, from the last few iterations of I (values for index I-2, I-3, etc.), but the compiler almost certainly does that already. I have 2 proofs for this suspicion. First, my manual loop unrolling made the program slower, by about 5%

DO K = 1, NZ
    DO J = 1, NY
        DO I = 1, NX, 8

            SIDEBACK(0) = STEN(I-1,J-1,K-1) + STEN(I-1,J,K-1) + STEN(I-1,J+1,K-1)
            SIDEBACK(1) = STEN(I  ,J-1,K-1) + STEN(I  ,J,K-1) + STEN(I  ,J+1,K-1)
            SIDEBACK(2) = STEN(I+1,J-1,K-1) + STEN(I+1,J,K-1) + STEN(I+1,J+1,K-1)
            SIDEBACK(3) = STEN(I+2,J-1,K-1) + STEN(I+2,J,K-1) + STEN(I+2,J+1,K-1)
            SIDEBACK(4) = STEN(I+3,J-1,K-1) + STEN(I+3,J,K-1) + STEN(I+3,J+1,K-1)
            SIDEBACK(5) = STEN(I+4,J-1,K-1) + STEN(I+4,J,K-1) + STEN(I+4,J+1,K-1)
            SIDEBACK(6) = STEN(I+5,J-1,K-1) + STEN(I+5,J,K-1) + STEN(I+5,J+1,K-1)
            SIDEBACK(7) = STEN(I+6,J-1,K-1) + STEN(I+6,J,K-1) + STEN(I+6,J+1,K-1)
            SIDEBACK(8) = STEN(I+7,J-1,K-1) + STEN(I+7,J,K-1) + STEN(I+7,J+1,K-1)
            SIDEBACK(9) = STEN(I+8,J-1,K-1) + STEN(I+8,J,K-1) + STEN(I+8,J+1,K-1)

            SIDEOWN(0) = STEN(I-1,J-1,K) + STEN(I-1,J,K) + STEN(I-1,J+1,K)
            SIDEOWN(1) = STEN(I  ,J-1,K) + STEN(I  ,J,K) + STEN(I  ,J+1,K)
            SIDEOWN(2) = STEN(I+1,J-1,K) + STEN(I+1,J,K) + STEN(I+1,J+1,K)
            SIDEOWN(3) = STEN(I+2,J-1,K) + STEN(I+2,J,K) + STEN(I+2,J+1,K)
            SIDEOWN(4) = STEN(I+3,J-1,K) + STEN(I+3,J,K) + STEN(I+3,J+1,K)
            SIDEOWN(5) = STEN(I+4,J-1,K) + STEN(I+4,J,K) + STEN(I+4,J+1,K)
            SIDEOWN(6) = STEN(I+5,J-1,K) + STEN(I+5,J,K) + STEN(I+5,J+1,K)
            SIDEOWN(7) = STEN(I+6,J-1,K) + STEN(I+6,J,K) + STEN(I+6,J+1,K)
            SIDEOWN(8) = STEN(I+7,J-1,K) + STEN(I+7,J,K) + STEN(I+7,J+1,K)
            SIDEOWN(9) = STEN(I+8,J-1,K) + STEN(I+8,J,K) + STEN(I+8,J+1,K)

            SIDEFRONT(0) = STEN(I-1,J-1,K+1) + STEN(I-1,J,K+1) + STEN(I-1,J+1,K+1)
            SIDEFRONT(1) = STEN(I  ,J-1,K+1) + STEN(I  ,J,K+1) + STEN(I  ,J+1,K+1)
            SIDEFRONT(2) = STEN(I+1,J-1,K+1) + STEN(I+1,J,K+1) + STEN(I+1,J+1,K+1)
            SIDEFRONT(3) = STEN(I+2,J-1,K+1) + STEN(I+2,J,K+1) + STEN(I+2,J+1,K+1)
            SIDEFRONT(4) = STEN(I+3,J-1,K+1) + STEN(I+3,J,K+1) + STEN(I+3,J+1,K+1)
            SIDEFRONT(5) = STEN(I+4,J-1,K+1) + STEN(I+4,J,K+1) + STEN(I+4,J+1,K+1)
            SIDEFRONT(6) = STEN(I+5,J-1,K+1) + STEN(I+5,J,K+1) + STEN(I+5,J+1,K+1)
            SIDEFRONT(7) = STEN(I+6,J-1,K+1) + STEN(I+6,J,K+1) + STEN(I+6,J+1,K+1)
            SIDEFRONT(8) = STEN(I+7,J-1,K+1) + STEN(I+7,J,K+1) + STEN(I+7,J+1,K+1)
            SIDEFRONT(9) = STEN(I+8,J-1,K+1) + STEN(I+8,J,K+1) + STEN(I+8,J+1,K+1)

            RES(I    ,J,K) = (  SIDEBACK(0) + SIDEOWN(0) + SIDEFRONT(0) + &
                                SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
                                SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) ) / 27.0
            RES(I + 1,J,K) = (  SIDEBACK(1) + SIDEOWN(1) + SIDEFRONT(1) + &
                                SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
                                SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) ) / 27.0
            RES(I + 2,J,K) = (  SIDEBACK(2) + SIDEOWN(2) + SIDEFRONT(2) + &
                                SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
                                SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) ) / 27.0
            RES(I + 3,J,K) = (  SIDEBACK(3) + SIDEOWN(3) + SIDEFRONT(3) + &
                                SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
                                SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) ) / 27.0
            RES(I + 4,J,K) = (  SIDEBACK(4) + SIDEOWN(4) + SIDEFRONT(4) + &
                                SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
                                SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6)   ) / 27.0
            RES(I + 5,J,K) = (  SIDEBACK(5) + SIDEOWN(5) + SIDEFRONT(5) + &
                                SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
                                SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) ) / 27.0
            RES(I + 6,J,K) = (  SIDEBACK(6) + SIDEOWN(6) + SIDEFRONT(6) + &
                                SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
                                SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) ) / 27.0
            RES(I + 7,J,K) = ( SIDEBACK(7) + SIDEOWN(7) + SIDEFRONT(7) + &
                                SIDEBACK(8) + SIDEOWN(8) + SIDEFRONT(8) + &
                                SIDEBACK(9) + SIDEOWN(9) + SIDEFRONT(9) ) / 27.0

        END DO
    END DO
END DO

And what's worse, it's easy to show we are already pretty close the theoretical minimal possible execution time. In order to calculate all these averages, the absolute minimum we need to do, is access every element at least once, and divide them by 27.0. So you can never get faster than the following code, which executes under 0.48-0.5 seconds on my machine.

program benchmark
implicit none

real :: start, finish
integer :: I, J, K

integer, parameter :: NX = 600
integer, parameter :: NY = 600
integer, parameter :: NZ = 600
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: STEN
real, dimension (0 : NX + 2, 0 : NY + 2, 0 : NZ + 2) :: RES
call random_number(STEN)

call cpu_time(start)
DO K = 1, NZ
    DO J = 1, NY
        DO I = 1, NX

            !This of course does not do what you want to do,
            !this is just an example of a speed limit we can never surpass.
            RES(I, J, K) = STEN(I, J, K) / 27.0

        END DO
    END DO
END DO
call cpu_time(finish)
!Use the calculated value, so the compiler doesn't optimize away everything.
print *, STEN(1,1,1), RES(1,1,1)
print '(f6.3," seconds.")',finish-start

end program

But hey, even a negative result is a result. If just accessing every element once (and dividing by 27.0) takes up more than half of the execution time, that just means memory access is the bottle neck. Then maybe you can optimize that.

Less data

If you don't need the full precision of 64-bit doubles, you can declare your array with a type of real(kind=4). But maybe your reals are already 4 bytes. In that case, I believe some Fortran implementations support non-standard 16-bit doubles, or depending on your data you can just use integers (maybe floats multiplied by a number then rounded to integer). The smaller your base type is, the more elements you can fit into the cache. The most ideal would be integer(kind=1), of course, it caused more than a 2x speed up on my machine, compared to real(kind=4). But it depends on the precision you need.

Better locality

Column major arrays are slow when you need data from neighbouring column, and row major ones are slow for neighbouring rows. Fortunately there is a funky way to store data, called a Z-order curve, which does have applications similar to your use case in computer graphics. I can't promise it will help, maybe it will be terribly counterproductive, but maybe not. Sorry, I didn't feel like implementing it myself, to be honest.

Parallelization

Speaking of computer graphics, this problem is trivially and extremely well parallelizable, maybe even on a GPU, but if you don't want to go that far, you can just use a normal multicore CPU. The Fortran Wiki seems like a good place to search for Fortran parallelization libraries.

szmate1618
  • 1,545
  • 1
  • 17
  • 21
  • 3
    `real(kind=4)` does not have to be 4 bytes, it may not exist at all and indeed with some compilers it doesn't. https://stackoverflow.com/questions/838310/fortran-90-kind-parameter Similarly, `real(kind=1)` does exist in some compilers but it may be a usual default real. – Vladimir F Героям слава Nov 02 '18 at 12:54
  • Unrolling may not be effective, as you have not used parens or other method to assure that the compiler recognizes common subexpressions between the adjacent assignments. Also, you should try small amounts of unroll at multiple levels. The unrolling in the inner loop is done automatically under the control of -funroll-loops with --param options, so there is little point in writing it in. – tim18 Nov 02 '18 at 13:36
  • It's possible that automatic unrolling is ineffective. It's also possible that manual unrolling is unnecessary. But it's not possible that automatic unrolling is ineffective, AND manual unrolling is unnecessary, I think that's a contradiction. One must be faster than the other. I think my manual unrolling and manual common subexpression optimization being slower than the automatic one shows that the compiler already does these optimizations, and it does them better than me. – szmate1618 Nov 02 '18 at 14:16
  • The compiler manages to vectorize the inner loop, which is what you would like. I tried to block the outer loops but that slowed down the runtime. – Manolete Nov 02 '18 at 15:18
  • I woulld also agree that memory access is what it makes this loop very slow, but I have no better ideas to speed the access to memory unless you use the L1 as much as possible. – Manolete Nov 02 '18 at 15:20
  • 1
    This example probably isn't large enough for loop blocking to be needed; it probably can go all the way through the inner loop into the next middle loop without losing any prefetched data. Hardware event profiling tools are meant to help speed experiments of this kind. – tim18 Nov 02 '18 at 17:30