1

After I recompile my (C) program, some nodes are running old compiles (with the debug information still in it), and some nodes are running the new copy. The server is running Gentoo Linux and all nodes get the file from the same storage. I'm told the filesystem is NFS. The MPI I'm using is MPICH Version 1.2.7. Why are some nodes not using the newly compiled copy?

Some more details (in case you're having trouble sleeping):

I'm trying to create my first MPI program (and I'm new to C and Linux, too). I have the following in my code:

#if DEBUG
  {
    int i=9;
    pid_t PID;
    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("PID %d on %s ready for attach.\n", PID=getpid(), hostname);
    fflush(stdout);
    while (i>0) {
      printf("PID %d on %s will wait  for `gdb` to attach for %d more iterations.\n", PID, hostname, i);
      fflush(stdout);
      sleep(5);
      i--;
    }
  }
#endif

Then I recompiled with (no -DDEBUG=1 option, so the above code is excluded)

$ mpicc -Wall -I<directories...> -c myprogram.c
$ mpicc -o myprogram myprogram.o -Wall <some other options...> 

The program compiles with no problems. Then I execute it like this:

$ mpirun -np 3 myprogram

Sometimes (and more and more frequently), different copies of the executable run on different nodes of the cluster? On some nodes, the debugging code executes (and prints) and on some nodes it doesn't.

Note that the cluster is currently experiencing some "clock skew" (or something like that), which may be the cause. Is that the problem?

Also note that I actually just change the compile options by commenting/uncommenting lines in a Makefile because I haven't had time to implement these suggestions yet.

Edit: When the problem occurs, md5sum myprogram returns a different value on the nodes where the issue presents itself.

Community
  • 1
  • 1
Jeff
  • 747
  • 3
  • 8
  • 17
  • 3
    The problem is because of clock skew. You should run NTP to synchronize the clocks on the cluster. Basically, the clock on some of the machines is in the future as compared to the build machine (or NFS server), so they think their copy of the file is newer. If you can't set up NTP on the cluster, you can delete the executable before every build as a workaround. – Greg Inozemtsev Aug 29 '12 at 01:42
  • OK. The big cheese is trying to figure out the clock skew now. But note that I did try deleting the executable (using `make clean`) and then recompiling before building, but the problem recurred anyway. What do you mean by "they think their copy of the file is newer"? Aren't they all accessing the same copy? Specifically, it's the one on the disk drive. Do you mean a cached copy of the program, or something similar? – Jeff Aug 29 '12 at 03:20
  • Sorry, the deleting idea won't work, for the same reason: caching. NFS caches file attributes and positive/negative lookups (file found/not found, which is why deletion did not work). This caching works together with the normal buffer cache - the cached buffers will only be invalidated when the attributes are. Now that I think about it, I'm not actually sure if clock skew affects NFS caching. It will definitely mess with `make` though. Try mounting your NFS share with the `noac` option to disable attribute caching. – Greg Inozemtsev Aug 29 '12 at 05:36
  • We fixed the clock skew, but the problem persists. Now the `md5sum` never gets updated, even when the new compiled version is the one executing. I guess I will try to talk the owner into mounting with `noac`. – Jeff Sep 05 '12 at 06:54
  • How much of a performance hit would `noac` cost us? What would it influence? Thanks. – Jeff Sep 07 '12 at 22:17

1 Answers1

2

Your different nodes have retained a copy of a file and are using that instead of the latest when you run the binary. This has little to nothing to do with Gentoo because it is an artifact of the Linux (kernel) caching and/or NFS implementations.

In other words, your binary is cached. Read this answer:

NFS cache-cleaning command?

Tweaking some settings may also help.


I happen to have a command here that syncs and flushes:

$ cat /home/jaroslav/bin/flush_cache 
sudo sync
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Community
  • 1
  • 1