what are the considerations for pointers/copies for OMP parallelized fortran program

Question

I am working on a large program that I have parallelized with OMP, and in several places for the parallelization I have to make a choice between using pointers or copies of arrays of data. I had an intuitive sense that the choice might have an impact on compute time, but I am uncertain. In the example program given below, I am unable to discern a difference.

compile with:

gfortran paralleltest.f90 -fopenmp

run with:

>>> ./a.out 8 10000000
 Copies (s):     0.835768998    
 results:    8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032     
 Pointers (s):   0.837329030    
 results:    8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032        8415334.9722864032

Given this example I am wondering about some questions:

in general would one expect to see no difference between pointers and copies used in this say?
what circumstances, if any, would lead to significant differences between the two, assuming that the memory overhead of the copies was not an issue.
am I measuring this performance difference in the right way? and if not, what is a better way to measure the performance?

Here is the fortran file:

! paralleltest.f90

module m_parallel_test

  implicit none

  type :: t_data
     real, allocatable :: x(:)
  end type t_data

  type :: t_data_ptr
     real, pointer :: x(:)
  end type t_data_ptr

contains

  real(kind=8) function aggregate_data(x) result(value)
    real :: x(:)
    integer :: i
    do i=1,size(x)
       value = value + cos(x(i))
    end do
  end function aggregate_data

  subroutine associate_pointer(parent, ptr)
    real, target :: parent(:)
    real, pointer :: ptr(:)
    ptr => parent
  end subroutine associate_pointer

end module m_parallel_test


program main

  use m_parallel_test

  implicit none

  character(len=32) :: arg
  integer, parameter :: n_set = 8
  integer :: n_threads, n_data, i

  type(t_data) :: parent
  type(t_data), allocatable :: copies(:)
  type(t_data_ptr), allocatable :: pointers(:)
  real(kind=8), allocatable :: copies_results(:)
  real(kind=8), allocatable :: pointers_results(:)
  real :: start_time, stop_time

  call get_command_argument(1, arg)
  read(arg, *) n_threads
  call omp_set_num_threads(n_threads)
  allocate(copies(n_set))
  allocate(pointers(n_set))
  allocate(copies_results(n_set))
  allocate(pointers_results(n_set))

  call get_command_argument(2, arg)
  read(arg, *) n_data
  allocate(parent%x(n_data))

  call seed_rng(1)
  call random_number(parent%x)

  do i=1,n_set
     copies(i)%x = parent%x
     call associate_pointer(parent%x, pointers(i)%x)
  end do

  call cpu_time(start_time)
  !$OMP PARALLEL DO
  do i=1,n_set
     copies_results(i) = aggregate_data(copies(i)%x)
  end do
  call cpu_time(stop_time)
  print *, "Copies (s):   ", stop_time - start_time
  print *, "results: ", copies_results

  call cpu_time(start_time)
  !$OMP PARALLEL DO
  do i=1,n_set
     pointers_results(i) = aggregate_data(pointers(i)%x)
  end do
  call cpu_time(stop_time)
  print *, "Pointers (s): ", stop_time - start_time
  print *, "results: ", pointers_results

end program main


subroutine seed_rng(i)
  integer, intent(in) :: i
  integer, allocatable :: iseed(:)
  integer :: n, j
  call random_seed(size=n)
  allocate(iseed(n))
  iseed= [(i + j, j=1, n)]
  call random_seed(put=iseed)
end subroutine seed_rng

Can't help with your actual quest, but I will point out the gfortran comand line you show above will do very little in the way of optimizing the code. At a minimum, you likely want to add `-O` to your command line. — evets, Sep 16 '20 at 18:32
@HighPerformanceMark, thanks for the comment. My question is probably derived from inexperience, but I am trying to understand if having multiple threads access shared memory has any noticeable performance difference from having multiple threads accessing copies of memory. In most cases it will just be reading the memory — Vince W., Sep 16 '20 at 23:21

score 2 · Accepted Answer · answered Sep 16 '20 at 21:01

The code is not conforming for two reasons:

the function result value in aggregate_data is not defined before being referenced.
in the main program, the pointer component x of the elements of pointers have undefined pointer association status when passed to the aggregate_data procedure, on account of the first actual argument in the calls to associate_pointer not having the TARGET attribute.

(You can call a procedure with a dummy argument that has the TARGET attribute with a corresponding actual argument that does not, but when you do, pointers associated with the dummy argument (such as ptr inside associate_pointer) become undefined when execution of the procedure completes.)

The errors in the program can be corrected by giving the parent variable in the main program the TARGET attribute, and by sticking an appropriate assignment statement prior to the loop in aggregate_data.

Beyond correctness, the only difference between the two loops is in the actual argument passed to aggregate_data - one loop passes an array designated by an allocatable component, one loop passes a array designated by a pointer component. The procedure takes that array argument as an assumed shape array dummy argument without CONTIGUOUS - given typical implementation of assumed shape arrays there's no need for the compiler to do anything different passing pointer vs allocatable.

(If the dummy argument of aggregate_data was explicitly or implicitly contiguous, there may be a need for the compiler to copy the pointer array prior to the call as pointers can be associated with non-contiguous arrays. Allocatable arrays are always contiguous.)

Both loops write their result to an non-target allocatable array of different kind to the actual argument components - so there cannot be any aliasing across the assignment statement.

Neither actual argument is being modified within the loops, so potential aliasing issues associated with the pointer arguments designating the same array in different iterations are moot (besides which, from the compiler's perspective - that's a guarantee that is on the head of the programmer - it doesn't have to care).

All up, I'd expect the same machine instructions to be issued for the two loops. Similar timing is unsurprising.

Results may differ if different work is done in the loops - if aliasing or contiguity comes into play.

Where the approaches will differ is in the data copying required to set the allocatable array components arguments up. But that is (intentionally?) outside that timing regime.

As of Fortran 2003 on (which this code requires), do not use POINTER's unless you need reference semantics.

Thank you for the answer, I am parsing it. What errors are you referring to? When I compiled, I saw no errors or warnings (gcc 10.2, std=gnu) and when I ran I saw no errors. — Vince W., Sep 16 '20 at 21:32
The errors are described by the first two dot points in the answer. Compilers will not, and can not, identify all errors in your code. — IanH, Sep 16 '20 at 21:49
Your explanation of this particular situation makes sense but I wonder if you could expand on the consequences of the pointer version having the threads all accessing the same array in memory versus the copies version all accessing different arrays in memory. I am largely ignorant of the consequences of this from a performance perspective. Maybe there are no consequences... I don’t know — Vince W., Sep 16 '20 at 22:08
Given what you have provided you have the only answer that can be given "Maybe". It will depend upon the details of what you are doing, and the only way to know for sure is to compare both methods. Also please don't user Real(8) is noted here ad infinitum, and I'll note pointers in Fortran are generally best avoided. — Ian Bush, Sep 17 '20 at 07:21
@IanBush, I wonder, for the sake of my own learning, if you would explain why you would avoid pointers in fortran, or provide a link to something that helps me understand what you're getting at. Also, are you suggesting not to use double precision, or something else? I didn't understand "don't user Real(8)". — Vince W., Sep 17 '20 at 12:33
Real( 8 ) is not portable. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter as an example, but it is a recurrent theme - and we have seen questions where it caused a problem. — Ian Bush, Sep 17 '20 at 13:04
Most uses of pointers can also be achieved using allocatable arrays or array sections. Pointers have a number of unique possible bugs associated with their use. By avoiding their use you avoid these bugs - and the features mentioned often (but not quite always) obviate their need. — Ian Bush, Sep 17 '20 at 13:06

what are the considerations for pointers/copies for OMP parallelized fortran program

1 Answers1