Here I present a simple fortran
code using OpenMP
that calculate a summation of arrays multiple times. My computers has 6 cores with 12 threads and memory space of 16G.
There are two versions of this code. The first version has only 1 file test.f90
and the summation is implemented in this file. The code is presented as follows
program main
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
real*8,allocatable,dimension(:,:,:)::theta, e
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
do i = 1, 1001
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
deallocate(theta)
deallocate(e)
end program main
This version has no problem on OpenMP
and we can see acceleration.
The second version is modified such that the implementation of summation is written in a subroutine. There are two files, test.f90
and sub.f90
which are presented as follows
! test.f90
program main
use sub
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
call summation()
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
end program main
and
! sub.f90
module sub
implicit none
contains
subroutine summation()
implicit none
real*8,allocatable,dimension(:,:,:)::theta, e
integer i, j
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
theta = 0.d0
e = 0.d0
do i = 1, 101
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
deallocate(theta)
deallocate(e)
end subroutine summation
end module sub
I also write a Makefile
as follows
FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp
FFLAGS = -c
LFLAGS =
result: sub.o test.o
$(LN) $(LFLAGS) -o result test.o sub.o
test.o: test.f90
$(FC) $(FFLAGS) -o test.o test.f90
sub.o: sub.f90
$(FC) $(FFLAGS) -o sub.o sub.f90
clean:
rm result *.o* *.mod *.e*
(we can use gfortran
instead) However, we I run this version, there will be dramatic slow-down in using OpenMP
and it is even much slower than the single-thread one (no OpenMP
). So, what happened here and how to fix this ?