4

I need to traverse a directory hierarchy containing about 20 million files in Java. Currently I'm using FileUtils.iterateFiles from Apache Commons-IO. This seems to work by loading the whole list into memory, which is slow (delaying the application startup time) and a huge memory hog (around 8GB). I was previously using my own recursive file iterator which had the same problem.

I only need to process one file at a time (or, down the track, a handful from the front of the list in parallel), so it seems a little unnecessary to waste all this time and memory loading a complete list into memory.

Java's Iterator class allows for minimal-memory footprint iterators of the kind that I need, but since the native features of the java.io.File class only provide eagerly-initialized arrays, it seems to be bizarrely difficult to take advantage of these.

Does anyone have any suggestions for how I can traverse the file hierarchy without loading it all into memory in advance?

Thanks to this answer I'm now aware of the new Java 7 file API, which I think would solve my problem, but Java 7 is not really an option for me at this stage.

Community
  • 1
  • 1
Andy MacKinlay
  • 1,374
  • 1
  • 16
  • 23
  • What for you need to traverse all the files if you need to process only one at a time? Please, tell us more about processing – Nikolay Kuznetsov Dec 10 '12 at 05:30
  • I only need to process one at a time, but I still need to process all of them eventually. I just don't have the memory to waste on storing 20 million `File` objects in memory when I'm only processing a handful at a time. Does that make sense? – Andy MacKinlay Dec 10 '12 at 05:33
  • you don't want to load the list of all files or the contents of all files to memory? – Nikolay Kuznetsov Dec 10 '12 at 05:34
  • I don't **want** to load the list of all files, but I can't see a way to avoid it - that is what I'm asking about (I'm definitely not reading the **contents** of all files into memory simultaneously now or in the future, but that is easy to avoid). – Andy MacKinlay Dec 10 '12 at 05:38
  • Why not implementing your own Iterator? It is so easy, and can handle memory issue. – Amir Pashazadeh Dec 10 '12 at 06:53
  • Good point Amir - see below. – Andy MacKinlay Dec 10 '12 at 23:23

3 Answers3

1

Since Java 7 NIO is not an option, you could execute "dir /B /A-D" (for Windows) and read file names from the output. If need be you could redirect the output to a temp file and read file names from there.

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
1

I know it's not strictly the answer to your question, but can you not reorganize the directory tree to use more levels of directories, so that each directory contains fewer files?

Dolda2000
  • 25,216
  • 4
  • 51
  • 92
  • This *could* work indeed, provided that you have a way to evenly (?) distribute files into separate directories, similar to a hash function. – Lefteris Laskaridis Dec 10 '12 at 07:26
  • I already have the files distributed among directories. Each directory has a maximum of 30000 files in them. Changing the nesting level actually doesn't help, since the `commons-io` file iterator (and my own half-baked solution) still load every file from the hierarchy into memory when traversing (unless I changed my approach to explicitly iterate to a particular depth in the hierarchy, but I wanted something more generic than that) – Andy MacKinlay Dec 10 '12 at 23:10
  • Why not just process one directory at a time, then? Storing some mere 30k filenames in memory shouldn't be a problem, should it? – Dolda2000 Dec 11 '12 at 03:10
  • No it indeed isn't, but I don't want my code to need to depend on there being a particular level of structure in the file hierarchy. The solution I've posted below allows for arbitrary levels of nesting with no changes to the calling code required. The maximal memory usage is affected only by the number of files in a single directory (which is intrinsic to Java) so it still needs to read 30k in (although if I had 20 million files in a single directory I'd still be in trouble) – Andy MacKinlay Dec 13 '12 at 00:18
1

OK, I've ended up implementing my own iterator to do this (as Amir suggested). It wasn't exactly trivial (although fortunately someone already wrote code to flatten iterators), but is reasonably straightforward

It still holds a complete listing of a single directory (without descendants) in memory, so it's no use for a flat directory layout (in that case I think you're out of luck using pure Java until Java 7) but so far it's working much better for my use case.

RecursiveFileIterable.java:

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

public class RecursiveFileIterable implements Iterable<File> {
    private File file;

    public RecursiveFileIterable(File f) {
        file = f;
    }

    public RecursiveFileIterable(String filename) {
        this(new File(filename));
    }

    private class DirectoriesOnlyFilter implements FileFilter {
        @Override
        public boolean accept(File pathname) {
            return pathname.isDirectory();
        }

    }

    private class NoDirectoriesFilter implements FileFilter {
        @Override
        public boolean accept(File pathname) {
            return !pathname.isDirectory();
        }
    }

    @Override
    public Iterator<File> iterator() {
        List<File> normFiles = Arrays.asList(file
                .listFiles(new NoDirectoriesFilter()));
        ArrayList<Iterable<File>> pendingIterables = new ArrayList<Iterable<File>>();
        pendingIterables.add(normFiles);

        File[] subdirs = file.listFiles(new DirectoriesOnlyFilter());
        for (File sd : subdirs)
            pendingIterables.add(new RecursiveFileIterable(sd));

        return new FlattenIterable<File>(pendingIterables).iterator();

    }

}

FlattenIterable.java:

// from http://langexplr.blogspot.com.au/2007/12/combining-iterators-in-java.html

import java.util.Iterator;

public class FlattenIterable<T> implements Iterable<T> {
    private Iterable<Iterable<T>> iterable;

    public FlattenIterable(Iterable<Iterable<T>> iterable) {
        this.iterable = iterable;
    }

    public Iterator<T> iterator() {
        return new FlattenIterator<T>(iterable.iterator());
    }

    static class FlattenIterator<T> implements Iterator<T> {
        private Iterator<Iterable<T>> iterator;
        private Iterator<T> currentIterator;

        public FlattenIterator(Iterator<Iterable<T>> iterator) {
            this.iterator = iterator;
            currentIterator = null;
        }

        public boolean hasNext() {
            boolean hasNext = true;
            if (currentIterator == null) {
                if (iterator.hasNext()) {
                    currentIterator = iterator.next().iterator();
                } else {
                    return false;
                }
            }

            while (!currentIterator.hasNext() && iterator.hasNext()) {
                currentIterator = iterator.next().iterator();
            }

            return currentIterator.hasNext();
        }

        public T next() {
            return currentIterator.next();
        }

        public void remove() {
        }
    }
}
Andy MacKinlay
  • 1,374
  • 1
  • 16
  • 23
  • I could also have used [`IteratorChain`](http://commons.apache.org/collections/api-release/org/apache/commons/collections/iterators/IteratorChain.html) from Apache Commons instead of `FlattenIterable` – Andy MacKinlay Dec 10 '12 at 23:42