After I recompile my (C) program, some nodes are running old compiles (with the debug information still in it), and some nodes are running the new copy. The server is running Gentoo Linux and all nodes get the file from the same storage. I'm told the filesystem is NFS. The MPI I'm using is MPICH Version 1.2.7. Why are some nodes not using the newly compiled copy?
Some more details (in case you're having trouble sleeping):
I'm trying to create my first MPI program (and I'm new to C and Linux, too). I have the following in my code:
#if DEBUG
{
int i=9;
pid_t PID;
char hostname[256];
gethostname(hostname, sizeof(hostname));
printf("PID %d on %s ready for attach.\n", PID=getpid(), hostname);
fflush(stdout);
while (i>0) {
printf("PID %d on %s will wait for `gdb` to attach for %d more iterations.\n", PID, hostname, i);
fflush(stdout);
sleep(5);
i--;
}
}
#endif
Then I recompiled with (no -DDEBUG=1 option, so the above code is excluded)
$ mpicc -Wall -I<directories...> -c myprogram.c
$ mpicc -o myprogram myprogram.o -Wall <some other options...>
The program compiles with no problems. Then I execute it like this:
$ mpirun -np 3 myprogram
Sometimes (and more and more frequently), different copies of the executable run on different nodes of the cluster? On some nodes, the debugging code executes (and prints) and on some nodes it doesn't.
Note that the cluster is currently experiencing some "clock skew" (or something like that), which may be the cause. Is that the problem?
Also note that I actually just change the compile options by commenting/uncommenting lines in a Makefile because I haven't had time to implement these suggestions yet.
Edit: When the problem occurs, md5sum myprogram
returns a different value on the nodes where the issue presents itself.