My main FORTRAN MPI code reaches a point where all processes call a script. The codelooks something like
write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=system(syscommand)
Now, this section of code loops though over a hundred times, and the script runs fine on all processes. Then, randomly as far as I can tell, some process will enter system and then will return an error code of 32512. A few other things then happen (sorry I can't show much more code. My employer would not be too happy.), then an MPI_ABORT is called and all the processes die. I am told that 32512 is often the error code returned when a command cannot be found. This is unlikely because, as I have indicated, the script is found hundreds of times before this crash, and nothing is moving it around.
I seem to have found a stop gap measure:
write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=32512
num_attempts=0
do while (num_attempts<100 .and. rc==32512)
num_attempts=num_attempts+1
rc=system(syscommand)
enddo
i.e. each process will try 100 times to get past the 32512 thing. Although I am sure this is horrible code, it is working.
So, anyone have a clue why I am getting this error? A thought: If two processes try to run the same script near simultaneously, will one of them be kicked out and forced to return that 32512? Thanks.