33

Can anyone recommend whether I should do something like:

os = new GzipOutputStream(new BufferedOutputStream(...));

or

os = new BufferedOutputStream(new GzipOutputStream(...));

Which is more efficient? Should I use BufferedOutputStream at all?

Audrius Meškauskas
  • 20,936
  • 12
  • 75
  • 93
sanity
  • 35,347
  • 40
  • 135
  • 226

6 Answers6

33

GZIPOutputStream already comes with a built-in buffer. So, there is no need to put a BufferedOutputStream right next to it in the chain. gojomo's excellent answer already provides some guidance on where to place the buffer.

The default buffer size for GZIPOutputStream is only 512 bytes, so you will want to increase it to 8K or even 64K via the constructor parameter. The default buffer size for BufferedOutputStream is 8K, which is why you can measure an advantage when combining the default GZIPOutputStream and BufferedOutputStream. That advantage can also be achieved by properly sizing the GZIPOutputStream's built-in buffer.

So, to answer your question: "Should I use BufferedOutputStream at all?" → No, in your case, you should not use it, but instead set the GZIPOutputStream's buffer to at least 8K.

barfuin
  • 16,865
  • 10
  • 85
  • 132
  • 5
    That's only half correct: The GZIPOutputStream internal output buffer determines how many bytes can be compressed in one go, but if you feed it small chunks of bytes (or individual bytes), it will still write to the backing output stream as soon as the Deflater can emit a block of compressed data, even if the output buffer would fit much more data. But that's fine if you put a second buffer in front of the GZIPOutputStream, ensuring only large chunks are written. So the answer is still correct, only the details are a bit more complicated than that. – defnull Feb 11 '19 at 09:53
31

What order should I use GzipOutputStream and BufferedOutputStream

For object streams, I found that wrapping the buffered stream around the gzip stream for both input and output was almost always significantly faster. The smaller the objects, the better this did. Better or the same in all cases then no buffered stream.

ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(fis)));
oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(fos)));

However, for text and straight byte streams, I found that it was a toss up -- with the gzip stream around the buffered stream being only slightly better. But better in all cases then no buffered stream.

reader = new InputStreamReader(new GZIPInputStream(new BufferedInputStream(fis)));
writer = new OutputStreamWriter(new GZIPOutputStream(new BufferedOutputStream(fos)));

I ran each version 20 times and cut off the first run and averaged the rest. I also tried buffered-gzip-buffered which was slightly better for objects and worse for text. I did not play with buffer sizes at all.


For the object streams, I tested 2 serialized object files in the 10s of megabytes. For the larger file (38mb), it was 85% faster on reading (0.7 versus 5.6 seconds) but actually slightly slower for writing (5.9 versus 5.7 seconds). These objects had some large arrays in them which may have meant larger writes.

method       crc     date  time    compressed    uncompressed  ratio
defla   eb338650   May 19 16:59      14027543        38366001  63.4%

For the smaller file (18mb), it was 75% faster for reading (1.6 versus 6.1 seconds) and 40% faster for writing (2.8 versus 4.7 seconds). It contained a large number of small objects.

method       crc     date  time    compressed    uncompressed  ratio
defla   92c9d529   May 19 16:56       6676006        17890857  62.7%

For the text reader/writer I used a 64mb csv text file. The gzip stream around the buffered stream was 11% faster for reading (950 versus 1070 milliseconds) and slightly faster when writing (7.9 versus 8.1 seconds).

method       crc     date  time    compressed    uncompressed  ratio
defla   c6b72e34   May 20 09:16      22560860        63465800  64.5%
Gray
  • 115,027
  • 24
  • 293
  • 354
  • It would have been nice to see the exact input data as well as standard deviation, otherwise the results aren't reproducible – darw Aug 12 '20 at 14:25
13

The buffering helps when the ultimate destination of the data is best read/written in larger chunks than your code would otherwise push it. So you generally want the buffering to be as close to the place-that-wants-larger-chunks. In your examples, that's the elided "...", so wrap the BufferedOutputStream with the GzipOutputStream. And, tune the BufferedOutputStream buffer size to match what testing shows works best with the destination.

I doubt the BufferedOutputStream on the outside would help much, if at all, over no explicit buffering. Why not? The GzipOutputStream will do its write()s to "..." in the same-sized chunks whether the outside buffering is present or not. So there's no optimizing for "..." possible; you're stuck with what sizes GzipOutputStream write()s.

Note also that you're using memory more efficiently by buffering the compressed data rather than the uncompressed data. If your data often acheives 6X compression, the 'inside' buffer is equivalent to an 'outside' buffer 6X as big.

gojomo
  • 52,260
  • 14
  • 86
  • 115
5

Normally you want a buffer close to your FileOutputStream (assuming that's what ... represents) to avoid too many calls into the OS and frequent disk access. However, if you're writing a lot of small chunks to the GZIPOutputStream you might benefit from a buffer around GZIPOS as well. The reason being the write method in GZIPOS is synchronized and also leads to few other synchronized calls and a couple of native (JNI) calls (to update the CRC32 and do the actual compression). These all add extra overhead per call. So in that case I'd say you'll benefit from both buffers.

roozbeh
  • 809
  • 7
  • 4
1

I suggest you try a simple benchmark to time how long it take to compress a large file and see if it makes much difference. GzipOutputStream does have buffering but it is a smaller buffer. I would do the first with a 64K buffer, but you might find that doing both is better.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
-1

Read the javadoc, and you will discover that BIS is used to buffer bytes read from some original source. Once you get the raw bytes you want to compress them so you wrap BIS with a GIS. It makes no sense to buffer the output from a GZIP, because one needs to think what about buffering GZIP, who is going to do that ?

new GzipInputStream( new BufferedInputStream ( new FileInputXXX
mP.
  • 18,002
  • 10
  • 71
  • 105
  • 2
    "you want to compress them so you wrap BIS with a GIS" - GIS does not compress. It decompresses. FWIW I'm struggling to understand your general point in the latter part of your answer. – bacar Oct 19 '13 at 17:52