1

My main FORTRAN MPI code reaches a point where all processes call a script. The codelooks something like

write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=system(syscommand)

Now, this section of code loops though over a hundred times, and the script runs fine on all processes. Then, randomly as far as I can tell, some process will enter system and then will return an error code of 32512. A few other things then happen (sorry I can't show much more code. My employer would not be too happy.), then an MPI_ABORT is called and all the processes die. I am told that 32512 is often the error code returned when a command cannot be found. This is unlikely because, as I have indicated, the script is found hundreds of times before this crash, and nothing is moving it around.

I seem to have found a stop gap measure:

write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=32512
num_attempts=0
do while (num_attempts<100 .and. rc==32512)
  num_attempts=num_attempts+1
  rc=system(syscommand)
enddo

i.e. each process will try 100 times to get past the 32512 thing. Although I am sure this is horrible code, it is working.

So, anyone have a clue why I am getting this error? A thought: If two processes try to run the same script near simultaneously, will one of them be kicked out and forced to return that 32512? Thanks.

bob.sacamento
  • 6,283
  • 10
  • 56
  • 115
  • 2
    I don't know what this error code means in your Fortran runtime but if you happen to use InfiniBand for interconnect between the MPI processes, then you should know that calling `fork(2)` (`system(3)` calls it) is dangerous and not supported in most MPI implementations on Linux - the child process might segfault. – Hristo Iliev May 17 '12 at 07:52

1 Answers1

4

Probably your compiler implements the system intrinsic as a call to the POSIX system(3) function provided by the system library.

This call returns an integer number that is organized as follows.

 bits 0-6 set -- the process was stopped
 bit 7 -- core flag
 bits 8-15 -- exit status of the child process    

The last line is the important one.

The return code 32512 is 0x7F00, i.e. the exit status of the child shell is 127. In Bourne shell and other UNIX shells this means the command was not found from PATH and is not a built-in shell command (see this question). It is also known as "command not found" error.

If anything could be mangling with your PATH variable this could be it. You may try to replace ./vscript.csh and all the commands within with their absolute paths?

On some MPI implementations spawning processes from MPI processes is not supported. We have seen issues with some versions of OpenMPI. If you call fork() or system() from an OpenMPI program you will get a warning:

An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.

On the other hand recent version of OpenMPI FAQ claims that

In general, if your application calls system() or popen(), it will likely be safe.

This limitation is not specific to OpenMPI and affects every implementation that relies on OpenFabrics stack.

Community
  • 1
  • 1
Dima Chubarov
  • 16,199
  • 6
  • 40
  • 76
  • Dmitri and Hristo, Thanks. Did not know about this. Can't think of another option besides calling this script. If I moved all the code of the script into the FORTRAN it would be a mess, and would involve a system call somewhere anyway. If anyone has any better ideas, I would be grateful to hear them. Will try the idea with the absolute path. Thanks again! – bob.sacamento May 17 '12 at 15:38