9

I am searching for an efficient way to iterate over thousands of files in one or more directories.

The only way to iterate over files in a directory seems to be File.list*() functions. These functions effectively load the entire list of files in some sort of Collection and then let the user iterate over it. This seems to be impractical in terms of time/memory consumption. I tried looking at commons-io and other similar tools. but they all ultimately call File.list*() somewhere inside. JDK7's walkFileTree() came close, but I don't have control over when to pick the next element.

I have over 150,000 files in a directory and after many -Xms/-Xmm trial runs I got rid of memory overflow issues. But the time it takes to fill the array hasn't changed.

I wish to make some sort of an Iterable class that uses opendir()/closedir() like functions to lazily load file names as required. Is there a way to do this?

Update:

Java 7 NIO.2 supports file iteration via java.nio.file.DirectoryStream. It is an Iterable class. As for JDK6 and below, the only option is File.list*() methods.

Oleksi
  • 12,947
  • 4
  • 56
  • 80
Unmanned Player
  • 1,109
  • 9
  • 30
  • I don't know whether a standard solution exists for that. I guess there is no other way to do that, but implement it on your own in C and access it via JNI... – Sergey Savenko Mar 28 '12 at 04:23
  • The answers in this question might help - http://stackoverflow.com/questions/1034977/how-to-retrieve-a-list-of-directories-quickly-in-java – charlemagne Mar 28 '12 at 04:24
  • I suspect that the real problem here is that you have a single directory with 150K files. I certainly wouldn't want to stress test a file system in that way. Can you not use subdirectories, perhaps grouping files by the first two characters in the filename or something? – Dilum Ranatunga Mar 28 '12 at 05:12
  • 1
    @DilumRanatunga: Years of experience has taught me that its more cost effective to fix code than ask users to change their way of working :) – Unmanned Player Mar 28 '12 at 05:26
  • Do the file names have a common pattern between them? – adranale Mar 28 '12 at 08:21

4 Answers4

4

Here is an example of how to iterate over directory entries without having to store 159k of them in an array. Add error/exception/shutdown/timeout handling as necessary. This technique uses a secondary thread to load a small blocking queue.

Usage is:

FileWalker z = new FileWalker(new File("\\"), 1024); // start path, queue size
Iterator<Path> i = z.iterator();
while (i.hasNext()) {
  Path p = i.next();
}

The example:

public class FileWalker implements Iterator<Path> {
  final BlockingQueue<Path> bq;
  FileWalker(final File fileStart, final int size) throws Exception {
  bq = new ArrayBlockingQueue<Path>(size);
  Thread thread = new Thread(new Runnable() {
    public void run() {
      try {
        Files.walkFileTree(fileStart.toPath(), new FileVisitor<Path>() {
          public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            try {
              bq.offer(file, 4242, TimeUnit.HOURS);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
            return FileVisitResult.CONTINUE;
          }
          public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
            return FileVisitResult.CONTINUE;
          }
        });
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  });
  thread.setDaemon(true);
  thread.start();
  thread.join(200);
}
public Iterator<Path> iterator() {
  return this;
}
public boolean hasNext() {
  boolean hasNext = false;
  long dropDeadMS = System.currentTimeMillis() + 2000;
  while (System.currentTimeMillis() < dropDeadMS) {
    if (bq.peek() != null) {
      hasNext = true;
      break;
    }
    try {
      Thread.sleep(1);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
  }
  return hasNext;
}
public Path next() {
  Path path = null;
  try {
    path = bq.take();
  } catch (InterruptedException e) {
    e.printStackTrace();
  }
  return path;
}
public void remove() {
  throw new UnsupportedOperationException();
}
}
Java42
  • 7,628
  • 1
  • 32
  • 50
  • Thanks! The extra thread part is a little bugging, but I'll figure out a way to push this Runnable on some drone thread. – Unmanned Player Mar 28 '12 at 08:35
  • @Eshan - Minor price to pay since it dies. But keep in mind it will stay alive if your while(hasNext()) terminates early. You need to add some failsafe code as you noticed. But this technique keeps memory usage very very low. – Java42 Mar 28 '12 at 13:30
1

This seems to be impractical in terms of time/memory consumption.

Even 150,000 file won't consume an impractical amount of memory.

I wish to make some sort of an Iterable class that uses opendir()/closedir() like functions to lazily load file names as required. Is there a way to do this?

You would need to write or find a native code library in order to access those C functions. It is probably going to introduce more problems than it solves. My advice would be to just use File.list() and increase the heap size.


Actually, there's another hacky alternative. Use System.exec to run the ls command (or the windows equivalent) and write your iterator to read and parse the command output text. That avoids the nastiness associated with using native libraries from Java.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • The software when designed 15 years ago made a mistake of forking threads to do things what the then designer percived it as "parallel". The current version today forks 100+ threads uses 1.5 GiB+ memory to become operational on JDK 6. The directory listing is simply adding more to that. That's what I meant when I said impractical. JNI/System.exec() is not an option here. – Unmanned Player Mar 28 '12 at 05:02
  • *"JNI/System.exec() is not an option here"*. Then you are out of options using Java 6. Sorry. – Stephen C Mar 28 '12 at 06:35
  • 1
    *"The software when designed 15 years ago made a mistake of forking threads to do things what the then designer percived it as "parallel"."*. It sounds like you need to fix THAT problem first. In fact, given that JNI and exec are not options, you probably have no choice. But the good news is that you can probably replace rampant thread forking by restructuring to use a bounded thread-pool executor service and thereby eliminate the memory overhead of 90+ thread stacks, etc. – Stephen C Mar 28 '12 at 06:41
  • (Or Plan C ... pushback on the users. "If you want to run this application on ridiculously large directories, buy a 64bit machine with lots of memory and run the application in a 64bit JVM. Otherwise, it crashes. Sorry.". – Stephen C Mar 28 '12 at 06:48
  • We've been doing that for quite sometime now. I felt a bit bad and took the initiative of cleaning the code and started pushing the management to allow me (and couple of others) to re-write some major part of performance offensive code. – Unmanned Player Mar 28 '12 at 09:16
0

Can you group your loads by file types to narrow down the batches?

TGH
  • 38,769
  • 12
  • 102
  • 135
  • Splitting files in groups across directories sounds good. I tried this on one of my users site and it turned out that they filled thousands of files in two directories one 'a-z' and the other '0-9'. Like I said in another comment, its easier to fix code than ask users to change the way they work :) – Unmanned Player Mar 28 '12 at 05:33
0

i was just wondering why a normal file.list() method which returns String[] of filenames ( not the file.listFiles() ) consuming lot of memory? Its a native call which just returns name of files. Probably u can iterate over it and lazily load whichever file object u need.

Kshitij
  • 8,474
  • 2
  • 26
  • 34