I need to traverse a directory hierarchy containing about 20 million files in Java. Currently I'm using FileUtils.iterateFiles
from Apache Commons-IO. This seems to work by loading the whole list into memory, which is slow (delaying the application startup time) and a huge memory hog (around 8GB). I was previously using my own recursive file iterator which had the same problem.
I only need to process one file at a time (or, down the track, a handful from the front of the list in parallel), so it seems a little unnecessary to waste all this time and memory loading a complete list into memory.
Java's Iterator
class allows for minimal-memory footprint iterators of the kind that I need, but since the native features of the java.io.File
class only provide eagerly-initialized arrays, it seems to be bizarrely difficult to take advantage of these.
Does anyone have any suggestions for how I can traverse the file hierarchy without loading it all into memory in advance?
Thanks to this answer I'm now aware of the new Java 7 file API, which I think would solve my problem, but Java 7 is not really an option for me at this stage.