8

I am using ByteBuffer.allocateDirect() to allocate some buffer memory for reading a file into memory and then eventually hashing that files bytes and getting a file hash (SHA) out of it. The input files range greatly in size, anywhere from a few KB's to several GB's.

I have read several threads and pages (even some on SO) regarding selecting a buffer size. Some advised trying to select one that the native FileSystem uses in an attempt to minimalize chances of a read operation for a partial block,etc. Such as buffer of 4100 bytes and NTFS defaults to 4096, so the extra 4 bits would require a separate read operation, being extremely wasteful.

So sticking with the powers of 2, 1024, 2048, 4096, 8192, etc. I have seen some recommend buffers the size of 32KB's, and other recommend making the buffer the size of the input file (probably fine for small files, but what about large files?).

How important is it to stick to native block sized buffers? Modernly speaking (assuming modern SATA drive or better with at least 8Mb of on drive cache, and other modern OS "magic" to optimize I/O) how critical is the buffer size and how should I best determine what size to set mine to? I could statically set it, or dynamically determine it? Thanks for any insight.

SnakeDoc
  • 13,611
  • 17
  • 65
  • 97
  • The only way you will know is to test. Note that what is best for one computer may not apply to another. Make or download some files of various sizes and read them in. See how long it takes. – Lee Meador Apr 17 '13 at 18:16
  • Possible duplicate of http://stackoverflow.com/questions/236861/how-do-you-determine-the-ideal-buffer-size-when-using-fileinputstream – Ben Barkay Apr 17 '13 at 18:23
  • @LeeMeador the problem with testing is the file input size is completely un-known. – SnakeDoc Apr 17 '13 at 18:29
  • Just use representative files of various sizes in the range you expect and average the bytes per second across the multiple files. For example, 3 - 300 MB might be an interesting range. Or 100K to 10 Gb. Your guess will be good enough for estimating. – Lee Meador Apr 17 '13 at 18:30

1 Answers1

6

To answer your direct question: (1) filesystems tend to use powers of 2, so you want to do the same. (2) the larger your working buffer, the less effect any mis-sizing will have.

As you say, if you allocate 4100 and the actual block size is 4096, you'll need two reads to fill the buffer. If, instead, you have a 1,000,000 byte buffer, then being one block high or low doesn't matter (because it takes 245 4096-byte blocks to fill that buffer). Moreover, the larger buffer means that the OS has a better chance to order the reads.

That said, I wouldn't use NIO for this. Instead, I'd use a simple BufferedInputStream, with maybe a 1k buffer for my read()s.

The main benefit of NIO is keeping data out of the Java heap. If you're reading and writing a file, for example, using an InputStream means that the OS reads the data into a JVM-managed buffer, the JVM copies that into an on-heap buffer, then copies it again to an off-heap buffer, then the OS reads that off-heap buffer to write the actual disk blocks (and typically adds its own buffers). In this case, NIO will eliminate that native-heap copies.

However, to compute a hash, you need to have the data in the Java heap, and the Mac SPI will move it there. So you don't get the benefit of NBI keeping the data off-heap, and IMO the "old IO" is easier to write.

Just don't forget that InputStream.read() is not guaranteed to read all the bytes you ask for.

parsifal
  • 206
  • 1
  • 3
  • hmm... i just spent a while refactoring my hashing method/algo to achieve increased performance. See this thread: http://stackoverflow.com/questions/16050827/filechannel-bytebuffer-and-hashing-files -- scroll down to my most recent update to see my current method -- the first method in the OP was my original method that worked very well, but I set out to optimize/micro-optimize it since the hashing is very integral in a project I'm working on. – SnakeDoc Apr 17 '13 at 19:44
  • 1
    @SnakeDoc - Looking at your other question, I see that you increased the size of the buffer to 8K. I wonder how the performance would change in the first (InputStream) version if you used a similar-size buffer? – parsifal Apr 17 '13 at 20:25
  • 1
    But, given that your ByteBuffer code looks reasonable, I'd turn to the idea of a much larger buffer to allow the OS to do a large sequential read. I don't know that you'll get a huge performance boost from it, but as some commenters have said, you'll only know by repeated testing. – parsifal Apr 17 '13 at 20:26
  • thanks for the input -- I honestly never went back and tried different buffer sizes in my original InputStream algo... as i had been advised that `java.nio.*` was the way to go for any performance-oriented i/o ops i needed to do... even if it's just for reading... hmm... I will need to test this, and try to post back for completeness. My hardest obstacle is how to properly "test" these things without the underlying "magic" the os and disk drives do to optimize I/O skewing my results... – SnakeDoc Apr 17 '13 at 21:04
  • 1
    @SnakeDoc - I'm not sure that you *should* try to isolate the OS and disk drives, because they'll affect your production timing. If you can, try to replicate your production environment in test (IMO, getting the OS and memory size right will be the biggest factors, drive type less so unless you're comparing SAN and SSD). For testing, try running the same program multiple times, and also try interleaving your programs. – parsifal Apr 18 '13 at 13:03
  • 1
    Finally, decide whether the buffer cache matters. If you're constantly processing new files in production, then the cache will make repeated test runs faster than they actually are. Here is a page that lists some ways to clear the cache (I can't remember if you use Linux or Windows, but I use Linux so that's what I look for): http://www.commandlinefu.com/commands/view/1026/empty-the-linux-buffer-cache – parsifal Apr 18 '13 at 13:05
  • It's Linux FTW baby ya! ;-p In seriousness though, thanks parsifal for the advice. What you say makes a lot of sense. My expectation is there will be almost no OS/Disk magic for most runs of the program since it's designed as a nightly job and the fileset will have changed dramatically throughout the day. I'm going to assemble a testrig to try multiple cache sizes over both algorithms and on multiple file sizes and see if I can get some half-way decent benchmarks. I'll post my test=rig code here once it's done so others can easily run their own benchmark. Thanks again! – SnakeDoc Apr 18 '13 at 14:38