How to optimize a function and minimize allocations

Question

The following function generates primes up to N. For large N, this becomes quite slow, my Julia implementation is 5X faster for N = 10**7. I guess the creation of a large integer array and using pack to collect the result is the slowest part. I tried counting .true.s first, then allocating res(:) and populating it using a loop, but the speedup was negligible (4%) as I iterate the prims array twice in this case. In Julia, I used findall which does exactly what I did; iterating the array twice, first counting trues and allocationg result then populating it. Any ideas? Thank you.

Compiler: Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.3.210 Build 20180410 (on Windows 10)

Options: ifort -warn /O3 -heap-arrays:8000000

program main
    implicit none
    integer, allocatable :: primes(:) 
    integer :: t0, t1, count_rate, count_max

    call system_clock(t0, count_rate, count_max)
    primes = do_primes(10**7)
    call system_clock(t1)

    print '(a,f7.5,a)', 'Elapsed time: ', real(t1-t0)/count_rate, ' seconds'
    print *, primes(1:10)

contains 

    function do_primes(N) result (res)
        integer, allocatable :: res(:), array(:) 
        logical, allocatable :: prims(:)
        integer :: N, i, j

        allocate (prims(N))
        prims = .true.
        i = 3
        do while (i * i < N) 
            j = i 
            do while (j * i < N)
                prims(j*i) = .false.
                j = j + 2
            end do
            i = i + 2
        end do 
        prims(1) = .false.
        prims(2) = .true.
        do i = 4, N, 2  
            prims(i) = .false.
        end do

        allocate (array(N)) 
        do i = 1, N 
            array(i) = i 
        end do 
        res = pack(array, prims)
    end 
end

Timing (147 runs):

Elapsed time: 0.14723 seconds

Edit:

I converted the do whiles to straight dos as per @IanBush comment like this, still no speedup:

do i = 3, sqrt(dble(N)), 2
    do j = i, N/i, 2
        prims(j*i) = .false.
    end do          
end do

The Julia implementation:

function do_primes(N)
    prims = trues(N)
    i = 3
    while i * i < N
        j = i 
        while j * i < N
            prims[j*i] = false
            j = j + 2
        end
        i = i + 2
    end 
    prims[1] = false
    prims[2] = true
    prims[4:2:N] .= false  
    return findall(prims) 
end

Timing:

using Benchmarktools 
@benchmark do_primes(10^7)
BenchmarkTools.Trial:
  memory estimate:  6.26 MiB
  allocs estimate:  5
  --------------
  minimum time:     32.227 ms (0.00% GC)
  median time:      32.793 ms (0.00% GC)
  mean time:        34.098 ms (3.92% GC)
  maximum time:     94.479 ms (65.46% GC)
  --------------
  samples:          147
  evals/sample:     1

When you say "I guess the creation of a large integer array and using pack to collect the result is the slowest part" have you actually profiled the code? I would have thought nested loops would be the section that takes the most time. And why do while rather than just straight do loops - it is my understanding that the later provide more opportunities for optimisation. — Ian Bush, Sep 03 '20 at 09:44
It's just that I see the nested loops are essential in the algorithm and cannot be avoided, so the obvious thing to me was avoiding the unnecessary creation of a large array just to pack the result. The `do while` is convenient for the inner loop as the termination condition is `j*i — AboAmmar, Sep 03 '20 at 10:02
If the nested loops take all the time there is no speed up that can be gained by optimising the pack. Please run your code with a profiler and tell us which parts of it take the time. Then, and only then, do you know where to concentrate your efforts — Ian Bush, Sep 03 '20 at 11:03
I would start with proper implementation of Eratosthenes sieve algorithm, you are doing way too much unnecessary work https://stackoverflow.com/questions/7921456/sieve-of-eratosthenes — Antonín Lejsek, Sep 03 '20 at 11:34
@ AntonínLejsek - It's not about algorithms, I know there are algorithms which scale much better for larges N. The question is about why for the exact same lines of code, Julia is 5X faster than Fortran? I expect Fortran to be at least as fast, if not faster. That is why I feel something is wrong! — AboAmmar, Sep 03 '20 at 16:33
@IanH - I added the Julia implementation with timing for reference, please see my update. — AboAmmar, Sep 03 '20 at 22:30
Julia seems to be using [BitArray](https://docs.julialang.org/en/v1/base/arrays/#Base.BitArray) for prims, which might help increase the speed (just guessing...). If I change prims = trues(N) to prims = fill(true,N), then the code becomes ~2 times slower on my mac (which is close to what I could get with a modified fortran code). FWIW, I tried a similar code in Python + [Numba](http://numba.pydata.org/), then the elapsed time was about x3 of Julia (with BitArray) and x1.5 of gfortran-10. Also, Julia and Numba use LLVM, so the difference may be related to LLVM vs GCC (just a guess...) — roygvib, Sep 04 '20 at 07:04
Yes - execution speed is being limited by memory bandwidth, and default logical is huge relative to a single bit. — IanH, Sep 04 '20 at 07:09
@roygvib - Good observation, I see similar timings as you did. Replacing Julia's `trues(N)` by `fill(true,N)` slows down the code 2X (65 ms) and using `logical(1)` in Fortran speeds up the code 2X (68 ms), so close to Julia's Bool. Can we go further with Fortran to match Julia's `BitArray`? Please consider posting an answer, thank you. — AboAmmar, Sep 04 '20 at 08:45
@AboAmmar Hi, sorry to be late for reply... and it is interesting that the timing changes very similarly on your environment also (ifort + windows) compared to mine (gfortran + mac). RE an attempt to match Julia's BitArray, I was not able to reach that speed by myself... (though it might be possible via some bitwise intrinsic functions and significant code modifications). RE an answer, if possible, could you add one by using your same environment + compiler/options (which I think would make the most fair comparison :-) — roygvib, Sep 07 '20 at 04:03

How to optimize a function and minimize allocations

0 Answers0