FFTW computation too slow with OpenMP

Question

I compile the following program with gfortran -O -mcmodel=medium test.f90 -lfftw3_omp -lfftw3 -lm -qopenmp and I have added fftw3.f to /usr/lib. And I use ulimit -s unlimited to run ./out because I get a stack overflow otherwise (Why Segmentation fault is happening in this openmp code?).

program main 


Implicit none
include  'omp_lib.h'
include 'fftw3.f'
Integer i, j, k, iter
Integer,Parameter :: Nx    =381
Integer,Parameter :: Ny    =129 
Integer,Parameter :: Nz    =129
Integer, parameter :: N1 =  Ny-1
Integer, parameter :: N2 =  Nz-1
double precision in
dimension in(N1,N2)
double complex out
dimension out(N1/2 + 1, N2)
integer*8 plan
real*8    U(-2:Nx+2,-2:Ny+2,-2:Nz+2)
real*8    V(-2:Nx+2,-2:Ny+2,-2:Nz+2)
real*8    W(-2:Nx+2,-2:Ny+2,-2:Nz+2)
Real*8  ky_vis(0:Ny)
Real*8  kz_vis(0:Nz)
Complex*16  F_F1(-2:Nx+2,-2:Ny+2,-2:Nz+2)
Real*8,Parameter :: pi  = 3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825d0
Real*8,Parameter :: Ly  = 1.0d0 * 2.0d0 * pi
Real*8,Parameter :: Lz  = 1.0d0 * 2.0d0 * pi
Real*8 c1_Nfft
Integer,Parameter :: NyFFT = (Ny-1 + 2) / 2



c1_Nfft = 1.0d0 / dsqrt( Dble(Ny-1) * Dble(Nz-1) )

U= reshape((/(i, i=1,(Nx+5)*(Ny+5)*(Nz+5))/),shape(U))
V =  reshape((/(i, i=1,(Nx+5)*(Ny+5)*(Nz+5))/),shape(V))
W =  reshape((/(i, i=1,(Nx+5)*(Ny+5)*(Nz+5))/),shape(W))

   !$OMP PARALLEL DO PRIVATE(j) SHARED(ky_vis)
    Do j = 1, (Ny-1)/2+1   
      ky_vis(j) = ( 2.0d0 * pi /Ly ) * Dble(        j-1 )
    End Do
    !$OMP End PARALLEL DO

    !$OMP PARALLEL DO PRIVATE(j) SHARED(ky_vis)
    Do j=(Ny-1)/2+2,Ny-1
      ky_vis(j) = ( 2.0d0 * pi /Ly ) * Dble((Ny-1)-(j-1))
    end do
    !$OMP End PARALLEL DO

     !$OMP PARALLEL DO PRIVATE(k) SHARED(kz_vis)
    Do k = 1, (Nz-1)/2+1   
      kz_vis(k) = ( 2.0d0 * pi /Lz ) * Dble(        k-1 )
    End Do
    !$OMP End PARALLEL DO

    !$OMP PARALLEL DO PRIVATE(j) SHARED(ky_vis)
    Do k=(Nz-1)/2+2,Nz-1
      kz_vis(k) = ( 2.0d0 * pi  /Lz ) * Dble((Nz-1)-(k-1))
    end do
    !$OMP End PARALLEL DO

!-----------FFTW----------------        
    Do i = 1, Nx-1 ! --- \8F\87\95ϊ\B7 ---
!          !$OMP PARALLEL DO PRIVATE(i,j,k) SHARED(Real_F1)
      Do k = 1, Nz-1
      Do j = 1, Ny-1
        in(j,k) = U(i,j,k)
      End Do; End Do
!         !$OMP END PARALLEL DO



     call dfftw_plan_dft_r2c_2d(plan,N1,N2,in,out,FFTW_ESTIMATE)
     call dfftw_execute_dft_r2c(plan, in, out)
     call dfftw_destroy_plan(plan)

!          !$OMP PARALLEL DO PRIVATE(j,k) SHARED(F_F1)
      Do j = 1, NyFFT
      Do k = 1, Nz-1
        F_F1(i,j,k) = out(j,k) * c1_Nfft
      End Do; End Do
!          !$OMP END PARALLEL DO
    End Do

!-----------IFFTW----------------  

    Do i = 1, Nx-1 !

      Do j = 1, NyFFT
      Do k = 1, Nz-1
        out(j,k) = - (  ky_vis(j) * ky_vis(j)                    &
                             + kz_vis(k) * kz_vis(k) ) * F_F1(i,j,k)
      End Do; End Do



      call dfftw_plan_dft_c2r_2d(plan,N1,N2,out,in,FFTW_ESTIMATE)
      call dfftw_execute_dft_c2r(plan, out, in)
      call dfftw_destroy_plan(plan)

!          !$OMP PARALLEL DO PRIVATE(j,k) SHARED(ddUdx)
      Do k = 1, Nz-1
      Do j = 1, Ny-1
        U(i,j,k) = U(i,j,k) + in(j,k) * c1_Nfft
      End Do; End Do
!          !$OMP END PARALLEL DO
    End Do

 end program

I used system_clock to check the FFTW execution time and found that it is actually too slow to compute. Is there a better way to make it run faster? I use gfortran on Ubuntu 17.04, CPU is AMD1950x, memory 64G.

Welcome, please take the [tour] and read [ask]. How it fails? Any error messages? Wrong results? How wrong? How do you compile the code? With which compiler? Please [edit] the question with more information. — Vladimir F Героям слава, Mar 18 '18 at 08:38
Also, we need to know the rest of the code and know the values of Nx and Ny, see [mcve]. Try to make Nx and Ny very small and report back if it still crashes. Try to make your arrays allocatable and report back if it still crashes. — Vladimir F Героям слава, Mar 18 '18 at 08:41
Although Hristo always recommends playing with the stack in his sometimes long and very detailed answers, I recommend `allocatable` arrays. — Vladimir F Героям слава, Mar 18 '18 at 08:46
sorry, I just want to know if there is any way to use multi-threaded fftw，because one of my threads is really slow to calculate fftw. I tried several wrong methods, so I did not submit it. NX=1000, NY=300, NZ=300. — Tom, Mar 18 '18 at 09:17
Please [edit] the question and show all the things I requested. Foremost, you **must** tell us **how does it fail**. What happens exactly? Please do see the [tour] and do read [ask]. Does it really crash or is it just slow with the code you show? Be *clear* and *detailed*. Show us a **complete** piece of code that we can compile, not just one subroutine. There are many examples of FFTW and OpenMP on this site but we need more information to solve your problem. — Vladimir F Героям слава, Mar 18 '18 at 09:56
Please, please, we need the detauls. We need more code ([mcve]) we need to be able to compile that ourselves and test it, it needs to be **complete**. NOT just one subroutine. And we need your test results. At least now we need it does run so it is not in fact a duplicate, but we really DO need much more details from you. Do NOT just add more sentence to your post, edit it thoroughly and add those many things I requested. I will then reopen the question, but it is not answerable now yet. Yes, OpenMP can be made fast, but we must see much more to tell you what is wrong now. — Vladimir F Героям слава, Mar 19 '18 at 08:33
You must also tell us **what** you are measuring and **how** you are measuring it. It is very, very important!!! You should NOT normally crate a plane for each Fourier transform. Creating the plan takes MUCH more time than executing it!!! It is really, really important to show the measuring results and tell exactly how you got them. Show us the **complete** code so that we can tell you how you can reuse the plan. — Vladimir F Героям слава, Mar 19 '18 at 08:38
@VladimirF sorry,this is a sample .I can compile it when I use ulimit -s unlimited,but it is too slow！ — Tom, Mar 20 '18 at 12:10
Thank you, I have re-opened the question. Still, you should tell us where did you put the `system_clock()` and the values you measured. Also, any code that should be fast **must** be compiled with optimizations like `-O2` or -O3`, although here it is less important because the library is already optimized. — Vladimir F Героям слава, Mar 20 '18 at 12:18
@VladimirF To be precise, this is a subroutine。I call this subroutine between system_clock(start) and system_clock(finish).I tried to compare the time spent running this subroutine with other subroutines and found that it took up most of my main program, so I wanted to improve it — Tom, Mar 20 '18 at 12:58
It says `program main`, not `subroutine`!!! Show us the **real** code. Always do that! It is not useful to show something else than what you are testing. Have you read [ask] and [mcve] yet? — Vladimir F Героям слава, Mar 20 '18 at 13:24
Thank you for your answer！I have solved it by putting the creation plan and destruction plan out of the loop！because my program has 4000 rows , I can not submit it ! Thank you ! — Tom, Mar 22 '18 at 07:04
@VladimirF Thank you for your answer！I have solved it by putting the creation plan and destruction plan out of the loop！because my program has 4000 rows , I can not submit it ! Thank you ! — Tom, Mar 22 '18 at 07:09

Vladimir F Героям слава · Answer 1 · 2018-03-20T12:36:25.093

Firstly, as I already indicated in my comments before, you should never repeatedly do

 do ... !your loop
   call dfftw_plan_dft_r2c_2d(plan,N1,N2,in,out,FFTW_ESTIMATE)
   call dfftw_execute_dft_r2c(plan, in, out)
   call dfftw_destroy_plan(plan)
end do

many times in a loop provided N1 and N2 are constant. It is not such a big disaster with FFTW_ESTIMATE but it is still slow. You should make the plan once and re-use it. Creating a FFTW plan is slow.

 call dfftw_plan_dft_r2c_2d(plan,N1,N2,in,out,FFTW_ESTIMATE)

 do .... !your loop
   call dfftw_execute_dft_r2c(plan, in, out)
 end do

 call dfftw_destroy_plan(plan)

Secondly, you must first instruct FFTW to use OpenMP threads. It is all in the manual, I will just show how I do it in my code:

  use iso_c_binding
  integer(c_int) :: nthreads
  integer(c_int) :: error

  !$ nthreads = omp_get_num_threads()

  error =  fftw_init_threads()

  if (error==0) then
    write(*,*) "Error when initializing FFTW for threads."
  else
    call fftw_plan_with_nthreads(nthreads)
  end if

This uses the modern FFTW Fortran interface, not your legacy Fortran 90 interface (yes Fortran 90 is old and obsolete). But it should be reasonably similar in the legacy interface, something like

    integer :: nthreads, error

    !$ nthreads = omp_get_num_threads()

    call dfftw_init_threads(error)

    call dfftw_plan_with_nthreads(nthreads)

FFTW computation too slow with OpenMP

1 Answers1