37

When reading from InputStreams, how do you decide what size to use for the byte[]?

int nRead;
byte[] data = new byte[16384]; // <-- this number is the one I'm wondering about

while ((nRead = is.read(data, 0, data.length)) != -1) {
  ...do something..
}

When do you use a small one vs a large one? What are the differences? Does the number want to be in increments of 1024? Does it make a difference if it is an InputStream from the network vs the disk?

Thanks much, I can't seem to find a clear answer elsewhere.

skaffman
  • 398,947
  • 96
  • 818
  • 769
cottonBallPaws
  • 21,220
  • 37
  • 123
  • 171
  • 3
    I'm wondering the same question for c#. I suppose it's the same answer. Probably the memory footprint can be taken into account (the smaller the chunck is, the smaller the memory footprint is). Another factor is the kind of input stream... A network stream will take longer to fill the buffer compared to a memory stream... You'll get less control with a large buffer. – Steve B Jan 05 '12 at 20:05
  • 1
    larger data should speed-up reading from fast source (less iterations) and, on the other hand, a waste of space in case of slow sources (speed is dominated by waiting, so it doesn't matter how much speedy is your loop) – akappa Jan 05 '12 at 20:08

5 Answers5

27

Most people use powers of 2 for the size. If the buffer is at least 512 bytes, it doesn't make much difference ( < 20% )

For network the optimal size can be 2 KB to 8 KB (The underlying packet size is typically up to ~1.5 KB) For disk access, the fastest size can be 8K to 64 KB. If you use 8K or 16K you won't have a problem.

Note for network downloads, you are likely to find you usually don't use the whole buffer. Wasting a few KB doesn't matter much for 99% of use cases.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Great info about the difference between network/disk! I'd guess the protocol used can make a lot of difference (CIFS, NFS, ...). Is it something you noticed (e.g. with Java Chonicles ;)) wrt network? I was about to ask such a question... – Matthieu May 04 '20 at 01:23
  • 1
    @Matthieu networks are usually configured to have an MTU of 1536 byte so if you read the stream fast enough you will rarely see a 2K buffer fill. Disk subsystems however tend to have native block sizes around 64 K so larger blocks are consistently filled for files much larger than this. – Peter Lawrey May 06 '20 at 17:32
4

In that situation, I always use a reasonable power of 2, somewhere in the range of 2K to 16K. In general, different InputStreams will have different optimal values, but there is no easy way to determine the value.

In order to determine the optimal value, you'd need to understand more about the exact type of InputStream you are dealing with, as well as things like the specifications of the hardware that are servicing the InputStream.

Worrying about this is probably a case of premature optimization.

JohnnyO
  • 3,018
  • 18
  • 30
3

It mostly depends on how much memory you have and how much data you expect to read. You don't want to block too often, so consider BenCole's answer; on the other hand, you don't want to process a small chunk of data if your processing is slower than the actual reading.

I personally try to use a library and offload the task of choosing a buffer size to library authors. After that, I promise myself never read the library code, because it makes me mad.

alf
  • 8,377
  • 24
  • 45
0

I'd also say that, if reading from an InputStream (not from a ReadableByteChannel like a FileChannel or a SocketChannel), you should not care, as long as you're wrapping it in a BufferedInputStream with a "correct" buffer size: the internal buffer will take care of the reads for you so you can focus on just reading the pieces you need.

In that case, the buffer size is probably what you're looking for and I would redirect you to @Peter Lawrey's answer: 2-8KB when the data is accessed from network, or 32-64KB when it's from hard drive (a "chunk" of disk).

When reading from a ByteChannel though, you'll have to do the buffering yourself through a ByteBuffer that you can allocate with that value.

Matthieu
  • 2,736
  • 4
  • 57
  • 87
0

By using the available() method in the InputStream class. From the Javadoc:

Returns the number of bytes that can be read (or skipped over) from this input stream without blocking by the next caller of a method for this input stream. The next caller might be the same thread or or another thread.

BenCole
  • 2,092
  • 3
  • 17
  • 26
  • 1
    The only sane use of the `available()` method is to determine whether the call might block. You should only care whether it's zero or nonzero. You should also code with the understanding that some implementations may return zero every single time, in which case your code needs to notice this and then disregard it going forward. `available()` is not guaranteed to return the total size of the data, the amount that will be filled by the next `read()`, or really anything in particular. – j__m Jan 10 '20 at 19:40
  • Current javadoc link does not agree with the quote given in the answer: https://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#available() – j__m Jan 10 '20 at 19:49