0

I'm calling (external) subroutine Objee from within another subroutine (FindVee):

 subroutine FindVee(EVone,Vw0,Ve,Fye)

  use nag_library, only: nag_wp

  use My_interface_blocks, only: Objee

  ...

  implicit none

  real(kind=nag_wp) :: apmax, Val

  ...

  call Objee(apmax,Val)

  write(*,*) 'After Objee', apmax, Val

  ...

 end subroutine FindVee

Subroutine Objee is:

subroutine Objee(ap,V)

 use nag_library, only: nag_wp

 ...

 implicit none

  real(kind=nag_wp), intent(in) :: ap

  real(kind=nag_wp), intent(out):: V

 ...

 V = U(x,sigma) + beta*piy*yhat1(Nav*(Nav+1)/2) + &
   & beta*eta*(1.0e0-piy)*yhat2(Nav*(Nav+1)/2)

 V = - V

 write(*,*) 'Exit Objee', ap, V

end subroutine Objee

Running the code like this, produces the following print on screen:

Exit Objee 0.0000000000000000 9997.5723796583643

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0 0x7FAA7ADFF7D7
#1 0x7FAA7ADFFDDE
#2 0x7FAA7A533FEF
#3 0x423B29 in findvee_

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7FAA7ADFF7D7
#1 0x7FAA7ADFFDDE
#2 0x7FAA7A533FEF
#3 0x7FAA7A0B9BA0
#4 0x7FAA7A0BAEFD
#5 0x7FAA7ADFF7D7
#6 0x7FAA7ADFFDDE
#7 0x7FAA7A533FEF
#8 0x423B29 in findvee_ Segmentation fault (core dumped)

I'm using gfortran 4.8.1, using the following options: -fopenmp -fcheck=all -fcheck=bounds -Wall -Wimplicit-interface -Wimplicit-procedure. The compiler doesn't show any warnings.

After a week of trying all sorts of things and scanning half the internet for a clue of what was happening, I thought I'd print the shape of V in Objee and see what fortran gave me- somehow it turns out it solves the problem:

subroutine Objee(ap,V)

 ...

 write(*,*) 'Exit Objee', ap, V, shape(V)

end subroutine Objee

produces the following on screen:

 Exit Objee   0.0000000000000000        9997.5723796583643     
 After Objee   0.0000000000000000        9997.5723796583643

Magic! Everything works and it seems like everything's right. Could somebody explain to me what's going on here? And also, how I can solve whatever was going on without printing shape(V) on screen with every call to Objee (which will be in the thousands...)

After running valgrind ./programa --leak-check=full, I obtain the following output:

==2784== Invalid write of size 8
==2784==    at 0x423B3F: findvee_ (FindVee.f95:66)
==2784==    by 0x8: ???
==2784==  Address 0x7ffffffffffffda8 is not stack'd, malloc'd or (recently) free'd
==2784== 

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x4E4D7D7
#1  0x4E4DDDE
#2  0x56A3FEF
#3  0x423B3F in findvee_ at FindVee.f95:66
==2784== Invalid read of size 8
==2784==    at 0x5C7FBA0: ??? (in /lib/x86_64-linux-gnu/libgcc_s.so.1)
==2784==    by 0x5C80EFD: _Unwind_Backtrace (in /lib/x86_64-linux-gnu/libgcc_s.so.1)
==2784==    by 0x4E4D7D7: _gfortran_backtrace (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==2784==    by 0x4E4DDDE: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==2784==    by 0x56A3FEF: ??? (in /lib/x86_64-linux-gnu/libc-2.17.so)
==2784==    by 0x423B3E: findvee_ (FindVee.f95:64)
==2784==    by 0x8: ???
==2784==  Address 0x8000000000000008 is not stack'd, malloc'd or (recently) free'd
==2784== 
==2784== 
==2784== Process terminating with default action of signal 11 (SIGSEGV)
==2784==  General Protection Fault
==2784==    at 0x5C7FBA0: ??? (in /lib/x86_64-linux-gnu/libgcc_s.so.1)
==2784==    by 0x5C80EFD: _Unwind_Backtrace (in /lib/x86_64-linux-gnu/libgcc_s.so.1)
==2784==    by 0x4E4D7D7: _gfortran_backtrace (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==2784==    by 0x4E4DDDE: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==2784==    by 0x56A3FEF: ??? (in /lib/x86_64-linux-gnu/libc-2.17.so)
==2784==    by 0x423B3E: findvee_ (FindVee.f95:64)
==2784==    by 0x8: ???
==2784== 
==2784== HEAP SUMMARY:
==2784==     in use at exit: 3,859 bytes in 20 blocks
==2784==   total heap usage: 157 allocs, 137 frees, 300,126 bytes allocated
==2784== 
==2784== LEAK SUMMARY:
==2784==    definitely lost: 58 bytes in 1 blocks
==2784==    indirectly lost: 0 bytes in 0 blocks
==2784==      possibly lost: 0 bytes in 0 blocks
==2784==    still reachable: 3,801 bytes in 19 blocks
==2784==         suppressed: 0 bytes in 0 blocks
==2784== Rerun with --leak-check=full to see details of leaked memory
==2784== 
==2784== For counts of detected and suppressed errors, rerun with: -v
==2784== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2)
Segmentation fault (core dumped)

Lines 64 and 66 (to which the output points) are:

64          call Objee(apmax,Val)

66          write(*,*) 'After Objee', apmax, Val

As an inexperienced user, I don't really understand how this helps me in any way, other than pointing to the portion of my code I already suspected was causing the crash. What am I missing here?

Nisse Engström
  • 4,738
  • 23
  • 27
  • 42
DrG
  • 151
  • 1
  • 7
  • Is `Objee` inside `Module`? Have you tried to compile with all compiler warnings on? Can you please add to your question full definition of `ap` and `V` (& definition of `nag_wp`)? – Peter Petrik Jul 07 '14 at 17:57
  • Thanks, Peter! I've added the info you're asking for above: yes to both your first two questions; ap and V are real(kind=nag_wp) just like apmax and Val; nag_wp is the data kind I need to use when calling procedures from the NAG library. – DrG Jul 07 '14 at 21:20
  • Are you using OpenMP? Does the problem go away if you remove option `-fopenmp`? – M. S. B. Jul 07 '14 at 21:29
  • I just tried running it without OpenMP and the issue persists. – DrG Jul 07 '14 at 23:37
  • Do you use [debugger](http://stackoverflow.com/q/3676322/2838364)? It seems that backtrace is more-like without debugging symbols. Are you on linux? In that case use `valgrind` – Peter Petrik Jul 08 '14 at 03:45
  • I've used valgrind and included the output above- I'm not sure I know how to extract any relevant information from that though. – DrG Jul 09 '14 at 00:49

1 Answers1

1

Memory errors like this in Fortran have two common causes. 1) illegal subscript access. 2) Mismatch between actual arguments in a procedure call and the dummy arguments of the subroutine. Modern compilers and Fortran >=90 give the programmer help in finding these problems. As suggested by Peter, are you using the full warning and error options of your compiler, esp. run-time subscript checking? (What compiler are you using?) If you place your procedures in a module and use that module Fortran will check for consistency between the arguments of the call and the subroutine. When a procedure is in a module, its interface is "known" to other procedures or the main program that uses that module, enabling this checking. With the These two methods will find many errors that cause memory problems.

The reason that adding "random" statements such as output can stop memory errors is that an illegal memory access may do damage to the new code that can be tolerated, whereas before it was doing fatal damaged, such as over writing an address with a data value, creating an illegal address. These bugs can be difficult to diagnose because the fatal error seems disconnected from the code mistake. The tools described in the first paragraph can be a big help.

M. S. B.
  • 28,968
  • 2
  • 46
  • 73
  • Thanks for your answer! I've edited the question to address your points. ap and V are real(kind=nag_wp), rank 0 like apmax and Val, so I don't think either 1) or 2) apply. I'm also using all the warnings gfortran provides, but the compiler doesn't seem to fin any errors or warnings to show. Finally, I have included Objee in an interface block in a module that I'm using in FindVee, still the issue persists. – DrG Jul 07 '14 at 21:26