Below are the heuristics that I use to aid my decisions when designing for fast file IO and a set of benchmarks that I use to test different alternatives.
Heuristics:
- preallocate the file, asking the OS to resize the file is expensive,
- stream the data as much as possible, avoiding seeking as they do not perform on spinning discs,
- batch the writes (while taking care to not create excessive GC problems),
- when designing for ssd's, avoid updating data in place.. that is the slowest op on a ssd. A complete guide to there SSD quirks can be read here
- where possible avoid copying data between buffers (this is where java nio can help) and
- if possible use memory mapped files. Memory mapped files are under used in Java, however handing the disk writes over to the OS
to perform asynchronously is typically an order of magnitude faster
than the alternatives; ie BufferedWriter and RandomAccessFile.
I wrote the following file benchmarks awhile ago. Give them a run: https://gist.github.com/kirkch/3402882
When I run the benchmarks, against a standard spinning disk I got these results:
Stream Write: 438
Mapped Write: 28
Stream Read: 421
Mapped Read: 12
Stream Read/Write: 1866
Mapped Read/Write: 19
All numbers are in ms, so smaller is better. Notice that memory mapped files consistently out perform every other approach.
The other surprise that I have found when writing these types of systems is that in later versions of Java, using BufferedWriter can be slower than just using FileWriter directly or RandomAccessFile. It turns out that buffering is done lower down now, I think that it happened when Sun rewrote java.io to use channels and byte buffers under the covers. Yet the advice of adding ones own buffering remains common practice. As aways measure first on your target environment, feel free to adjust the benchmark code above to experiment further.
While looking for links to back up some of the facts above, I came across Martin Thompson's post on this topic. It is well worth a read.