1

I'm working on Intel Fortran compatibility for a fairly large code (weather prediction model). On Intel Fortran (and only Intel Fortran) some character data seems to get scrambled looping through and point character pointers to it. The characters I end up with only contains 0 and 9. The character string c simply has "00:00:00:16 (whitespace padding....)" as its content.

I was able to reproduce the issue to come up whenever I include any (dead) OpenMP code and use -fopenmp compiler flag. I'm a bit lost at what's going on here. Can you detect any programming errors in the following minimal reproducer? If not, I guess I will file a compiler bug.

Reproducer

minimal.f90

module foo
    implicit none

    type pointer_character
    character, pointer :: c
    end type
    type(pointer_character), allocatable :: ptr_c(:)
    integer(4) :: idx_c

    contains

    subroutine grpbcast_set_c(c)
        implicit none
        character(*), intent(in), target :: c
        integer(4) :: cloc

        allocate(ptr_c( 256 ))
        idx_c = 0

        do cloc = 1, len(c)
            idx_c = idx_c + 1
            ptr_c( idx_c )%c => c( cloc:cloc )
        end do

        print *, "testprint-2", c
        print *, "testprint-3.1", ptr_c( 1 )%c
        print *, "testprint-3.2", ptr_c( 2 )%c
        print *, "testprint-3.3", ptr_c( 3 )%c
        print *, "testprint-3.4", ptr_c( 11 )%c

    end subroutine

    ! The following subroutine is never called, but if we include it in the module foo, it will lead to the data corruption documented in
    ! http://stackoverflow.com/questions/42359258/intel-fortran-substring-access-with-convert-big-endian
    ! If the subroutine is commented it will work
    !
    ! In other words:
    ! if we comment this subroutine out, the output will be
    ! =====================================================
    ! testprint-2
    ! 00:00:00:16



    ! testprint-3.10
    ! testprint-3.20
    ! testprint-3.3:
    ! testprint-3.46
    ! =====================================================
    ! However if we include it, the output will be
    ! testprint-2
    ! 00:00:00:16



    ! testprint-3.10
    ! testprint-3.20
    ! testprint-3.30
    ! testprint-3.40
    ! =====================================================
    !
    ! tested with: ifort 17.0.1 20161005
    subroutine scatter_one_record()
     implicit none
     integer(4) :: k

!$OMP PARALLEL DO
    do k = 1, 5
    end do
!$OMP END PARALLEL DO
    end subroutine

end module foo

program main
    use foo, only: grpbcast_set_c
    implicit none

    character(len=256), target:: run_period
    run_period = "00:00:00:16"

    call grpbcast_set_c(run_period)
end program

Makefile

.PHONY: all

all: minimal

minimal.o: minimal.f90
    ifort -fopenmp -c minimal.f90 -o minimal.o

minimal: minimal.o
    ifort -fopenmp -o minimal -L./ minimal.o

ifort version

> ifort --version
ifort (IFORT) 17.0.1 20161005

compile&run

make
./minimal

..

..

Previous Analysis, only here to understand the comment chain for this question

Loop in question

subroutine grpbcast_set_c(c)
 use nrtype, only : rp => rp
 implicit none
 character(*), intent(in), target :: c
 integer(4) :: cloc

 do cloc = 1, len(c)
  idx_c = idx_c + 1
  if (idx_c > max_ptr_gbcast) stop 9
  ptr_c( idx_c )%c => c( cloc:cloc )
 end do
end subroutine

Specification of ptr_c in module

type pointer_character
   character, pointer :: c
end type

type(pointer_character), private, allocatable, save :: ptr_c(:)

Data as seen in debugger (TotalView)

Data at Runtime

Compiler Command

> mpif90 -O0 -no-ipo -g -convert big_endian -fopenmp -r8 -DUSE_MPI -I/usr/apps.sp3/mpi/openmpi/1.6.5/i2013.1.046/include -I/usr/apps.sp3/mpi/openmpi/1.6.5/i2013.1.046/lib -I [NUSDAS13_PATH]/src -I [NETCDF_PATH]/include -DUSE_MPI -c mpi_comm.f90 -o mpi_comm.o

> mpif90 --version
ifort (IFORT) 14.0.2 20140120
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

Update

I just tried the same with the newest Intel compiler version:

> mpif90 --version
> ifort (IFORT) 17.0.1 20161005

The result is still wrong but different! This time I get all the characters zeroed.

Newest Intel

Newest Intel - Beginning of Array

Update 2

I've been trying to reproduce the issue with a minimal reproducer (see below), so far no luck, i.e. the error does not show up when running it this way. However, one thing is interesting: TotalView shows the first address of the array slightly differently:

with working array pointer

I guess it must be a problem with the input c, even though it shows up correctly in the debugger at the point where grpcast_set_c is entered.

module foo
    implicit none

    type pointer_character
        character, pointer :: c
    end type
    type(pointer_character), allocatable, save :: ptr_c(:)
    integer(4), save :: idx_c
contains

    subroutine parmread_bcast_a( mt, aout, chktag )
        implicit none
        integer(4), intent(in):: mt
        character(*), intent(out):: aout
        character(*), intent(in):: chktag

        call filetool_cardread_a( mt, aout, chktag )
        write(6,'("* ",a28," = ", a)') chktag, trim(aout)

        call set_ptr_c(aout)
        return
    end subroutine parmread_bcast_a

    subroutine filetool_cardread_a( imt, aout, chktag )
        implicit none
        integer(4), intent(in):: imt
        character(*), intent(out):: aout
        character(*), intent(in), optional:: chktag

        character(len=256):: tag
        character(len=256):: val
        character(len=256):: buff
        integer(4):: idxeql
        integer(4) :: is_debug = 0

        cardread: do
            read( imt, "(a)" ) buff
            if( buff(1:1) /= "*" ) then
                idxeql = index(buff, "=")
                tag = adjustl(buff(:idxeql-1))
                if (is_debug == 1) then
                write(6,*) 'buff', trim(buff)
                endif
                if ( present( chktag ) ) then
                if ( trim(chktag) /= trim(tag) ) then
                    write(6,*)"WARNING !!! the record is for ", trim(tag), &
                    & ", while you specified ", trim(chktag)
                    cycle cardread
                endif
                endif
                val = adjustl(buff(idxeql+1:))
                aout = val
                return
            end if
        end do cardread
    end subroutine filetool_cardread_a

    subroutine set_ptr_c(c)
        implicit none
        character(*), intent(in), target :: c
        integer(4) :: cloc

        do cloc = 1, len(c)
            idx_c = idx_c + 1
            if (idx_c > 10000) stop 9
            ptr_c( idx_c )%c => c( cloc:cloc )
        end do
    end subroutine
end module

program main
    use foo, only: ptr_c, idx_c, parmread_bcast_a
    implicit none

    character(len=256), save:: run_period

    allocate(ptr_c( 10000 ))
    idx_c = 0

    open( 100, file='sample.conf', form='formatted')
    call parmread_bcast_a(100, run_period, "run_period")

    print *, "run period", run_period

    print *, ptr_c( 1 )%c
    print *, ptr_c( 2 )%c
    print *, ptr_c( 3 )%c
    print *, ptr_c( 11 )%c
end program

sample.conf

  run_period               =    00:00:00:16
Michel Müller
  • 5,535
  • 3
  • 31
  • 49
  • I am alnost sure more code will be necessary. Note that endianness does not have anything to do with characters. We will need a [mcve] – Vladimir F Героям слава Feb 21 '17 at 06:21
  • Where do you get the value of `c`? Convert_bigendian is about reading from external files and there is no IO in your example. – Vladimir F Героям слава Feb 21 '17 at 06:23
  • It gets read from a configuration file. But if I understand TotalView correctly, when I step into the aforementioned routine the input still looks fine. Plus I can also print it to console and use it on the rank that has read the config file without issues. The problem surfaces with multiple MPI processes where it broadcasts this config data to the other ranks, which is when this array-of-structs-of-pointers `ptr_c` is used. – Michel Müller Feb 21 '17 at 06:37
  • But there is no broadcasting or even any MPI in your code example, we can't reproduce anything. – Vladimir F Героям слава Feb 21 '17 at 06:43
  • That's the thing - it happens *before* there is any MPI code involved if you look at the TotalView screenshots. – Michel Müller Feb 21 '17 at 06:52
  • We *really* need a [mcve]. It may be a compiler bug. I have reported quite a few before, but **always** I had to cerefully examine how can it be reproduced. – Vladimir F Героям слава Feb 21 '17 at 07:01
  • The loop in question has an uninitialised variable `idx_c`. Who knows what could be happening... – RussF Feb 21 '17 at 07:02
  • In your example **Update 2** you are passing a string literal which can be just a temporary to a dummy argument which is `target` and saving the pointer. That cannot work!!! The pointer is valid **only until the procedure exits**. For the code to be valid you must have a character variable which is `target` also in the outer scope. – Vladimir F Героям слава Feb 21 '17 at 07:06
  • Hold on, I'm putting together a better minimal example. – Michel Müller Feb 21 '17 at 07:10
  • So we need a MCVE which clearly shows what `c` is and how the value is read. – Vladimir F Героям слава Feb 21 '17 at 07:10
  • So far no luck, but the minimal example I posted show how the data is set up. @Vladimir F: Earlier you wrote that the string being pointed to needs to be target in the outer scope. In my original code `run_period` is defined in an external module with `save` directive, however no `target`. Could this be the source? – Michel Müller Feb 21 '17 at 07:30
  • 1
    Theoreticslly yes. If it does not have `target` you cannot have pointers to it. It meyblead to unpredictable behaviour when it mostly works, but sometimes... – Vladimir F Героям слава Feb 21 '17 at 07:33
  • You create confusion about your choices of Fortran and mpi implementation. Under Intel mpi, mpif90 would invoke gfortran. With some open source mpi, the effect of mpif90 would depend on you rebuilding and coordinating install path. Big endian option spelling differs between ifort and gfortran. – tim18 Feb 21 '17 at 15:31
  • I'm using OpenMPI together with ifort. mpif90 is just then just a wrapper for ifort, mostly making includes and linking easier. – Michel Müller Feb 21 '17 at 21:03
  • Alright, I was able to reproduce it. It comes down to OpenMP being enabled by the compiler or not, regardless whether it's actually used at runtime. I only have one cluster environment with ifort available, could someone with ifort try and run the reproducer and see whether the same happens? That would be a lot of help. – Michel Müller Feb 22 '17 at 07:16
  • 1
    Looks like a compiler bug which should be reported to Intel. Either a public support forum or your Premier Support. – Vladimir F Героям слава Feb 22 '17 at 07:54
  • Thank you, I just posted it on the public forum and I'm sending it to our administrators for a potential premier support request. https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/713631 – Michel Müller Feb 22 '17 at 08:38

1 Answers1

0

As a workaround, separating the OpenMP code from this kind of datastructure initialization into multiple modules (even in the same file), works. In above case:

module foo1
    implicit none

    type pointer_character
    character, pointer :: c
    end type
    type(pointer_character), allocatable :: ptr_c(:)
    integer(4) :: idx_c

    contains
    subroutine grpbcast_set_c(c)
        implicit none
        character(*), intent(in), target :: c
        integer(4) :: cloc

        allocate(ptr_c( 256 ))
        idx_c = 0

        do cloc = 1, len(c)
            idx_c = idx_c + 1
            ptr_c( idx_c )%c => c( cloc:cloc )
        end do

        print *, "testprint-2", c
        print *, "testprint-3.1", ptr_c( 1 )%c
        print *, "testprint-3.2", ptr_c( 2 )%c
        print *, "testprint-3.3", ptr_c( 3 )%c
        print *, "testprint-3.4", ptr_c( 11 )%c

    end subroutine
end module    

module foo2
    implicit none

    contains
    subroutine scatter_one_record()
     implicit none
     integer(4) :: k

!$OMP PARALLEL DO
    do k = 1, 5
    end do
!$OMP END PARALLEL DO
    end subroutine
end module

program main
    use foo1, only: grpbcast_set_c
    implicit none

    character(len=256), target:: run_period
    run_period = "00:00:00:16"

    call grpbcast_set_c(run_period)
end program
Michel Müller
  • 5,535
  • 3
  • 31
  • 49