7

Suppose I want to get several of a file's properties (owner, size, permissions, times) as returned by the lstat() system call. One way to do this in Java is to create a java.io.File object and do calls like length(), lastModified(), etc. on it. I have two problems so far:

  1. Each one of these calls triggers a stat() call, and for my purposes stat()s are considered expensive: I'm trying to scan billions of files in parallel on hundreds of hosts, and (to a first approximation) the only way to access these files is via NFS, often against filer clusters where stat() under load may take half a second.

  2. The call isn't lstat(), it's typically stat() (which follows symlinks) or fstat64() (which opens the file and may trigger a write operation to record the access time).

Is there a "right" way to do this, such that I end up just doing a single lstat() call and accessing the members of the struct stat? What I have found so far from Googling:

  • JDK 7 will have the PosixFileAttributes interface in java.nio.file with everything I want (but I'd rather not be running nightly builds of my JDK if I can avoid it).

  • I can roll my own interface with JNI or JNA (but I'd rather not if there's an existing one).

A previous similar question got a couple of suggested JNI/JNA implementations. One is gone and the other is questionably maintained (e.g., no downloads, just an hg repository).

Are there any better options out there?

Community
  • 1
  • 1
Aaron D. Ball
  • 83
  • 1
  • 6

3 Answers3

3

Looks like you've pretty much covered all the bases. When I started reading your question my first thought was JDK 7 or JNI. Without knowing anything about the change pattern on these files you might also look into some sort of persistent cache of the information in question, like an embedded DB. You could also look at some other access method besides NFS, like a custom web service that provides bulk file information from a remote host.

Jherico
  • 28,584
  • 8
  • 61
  • 87
  • Thanks! Ultimately I guess JDK 7 isn't so bad; I can just keep the binaries with the tool I'm writing, and it will be production-grade software soon enough. – Aaron D. Ball Dec 16 '09 at 13:34
1

Yes, stat() is under all the calls and libraries. It is a latency problem. However, you can do many stat() at once, as there are many NFS server daemons to support your connections, using threads unless someone has an asynchronous stat() up their sleeve! If you could get on the host, like with ssh, stat() would be much cheaper. You could even write a tcp service to stream in paths and stream out stat(). Unfortunately, access to the NFS server is hard or impossible, as it may only have admin accounts, be a Hitachi SAN or something.

DavidP
  • 11
  • 1
  • 1
    For a little historical background: the NFS servers in question were a 5-10PB Isilon cluster, which provided strict consistency on stat calls but at the cost of terrible latency under contention. (I'm still not sure if they had a big lock or something more sophisticated.) This was a filesystem-level problem: we didn't do any better sshed in as root. We ended up just letting it take its time rather than spend a few days of people-time trying to save a few days of computer-time. – Aaron D. Ball Dec 30 '16 at 03:53
0

Each one of these calls triggers a stat() call, and for my purposes stat()s are considered expensive: I'm trying to scan billions of files in parallel on hundreds of hosts, and (to a first approximation) the only way to access these files is via NFS, often against filer clusters where stat() under load may take half a second.

There's no much to do here, linux provides an interface which only accepts one file descriptor at time.

Is there a "right" way to do this, such that I end up just doing a single lstat() call and accessing the members of the struct stat? .... got a couple of suggested JNI/JNA implementations. One is gone and the other is questionably maintained (e.g., no downloads, just an hg repository).

Call a c function using JNA is very straightforward, a wrapper library may be not necessary, bellow a snippet to make a call to stat or lstat syscall, see my complete answer for more details:

// wont call lstat c function directly 
// cause stat and lstat aren't available at libc 2.31-
public interface Stats extends Library {

  Stats INSTANCE = Native.loadLibrary(Platform.C_LIBRARY_NAME, Stats.class);

  int syscall(int number, Object... args);

  default int doStat(String pathname, Stat statbuf){
    return this.syscall(4, pathname, statbuf);
  }

  default int doLstat(String pathname, Stat statbuf){
    return this.syscall(6, pathname, statbuf);
  }
}
deFreitas
  • 4,196
  • 2
  • 33
  • 43