GZIPOutputStream that does its compression in a separate thread

Question

Is there an implemetation of GZIPOutputStream that would do the heavy lifting (compressing + writing to disk) in a separate thread?

We are continuously writing huge amounts of GZIP-compressed data. I am looking for a drop-in replacement that could be used instead of GZIPOutputStream.

Ah, correct me if I'm wrong but can't you simply wrap the GZIPOutputStream with a thread yourself? — Femi, Sep 21 '12 at 14:03
See my new answer below (sorry that it is over 7 years after the question was asked!). — Luke Hutchison, Oct 27 '19 at 05:08

Peter Lawrey · Accepted Answer · 2012-09-21T14:26:24.100

5

You can write to a PipedOutputStream and have a thread which reads the PipedInputStream and copies it to any stream you like.

This is a generic implementation. You give it an OutputStream to write to and it returns an OutputStream for you to write to.

public static OutputStream asyncOutputStream(final OutputStream out) throws IOException {
    PipedOutputStream pos = new PipedOutputStream();
    final PipedInputStream pis = new PipedInputStream(pos);
    new Thread(new Runnable() {
        @Override
        public void run() {
            try {
                byte[] bytes = new byte[8192];
                for(int len; (len = pis.read(bytes)) > 0;)
                    out.write(bytes, 0, len);
            } catch(IOException ioe) {
                ioe.printStackTrace();
            } finally {
                close(pis);
                close(out);
            }
        }
    }, "async-output-stream").start();
    return pos;
}

static void close(Closeable closeable) {
    if (closeable != null) try {
        closeable.close();
    } catch (IOException ignored) {
    }
}

edited Sep 21 '12 at 14:26

answered Sep 21 '12 at 14:10

Peter Lawrey

525,659
79
751
1,130

That's much better than my answer :) – David Grant Sep 21 '12 at 14:12
Sounds promising. How would you attach the `PipedInputStream` to the `GZIPOutputStream` (both in the worker thread)? Is there an efficient stream copier for this purpose? – krlmlr Sep 21 '12 at 14:22
I have added a sample implementation. – Peter Lawrey Sep 21 '12 at 14:26
Thank you! It would have taken me ages to derive this. I'm going to enhance this with a thread pool and post here. -- Two more questions: 1. I understand that I can tweak the buffer size for performance, right? 2. This reference http://thomaswabner.wordpress.com/2007/10/09/fast-stream-copy-using-javanio-channels/ presents fast copying using NIO. Do you think it will help using this instead of the "trivial" approach, in terms of performance? – krlmlr Sep 21 '12 at 14:43
One more thing: It appears to me that the `PipedInputStream` is created in the same thread as the `PipedOutputStream`.? Shouldn't it be created in the worker thread instead? – krlmlr Sep 21 '12 at 14:50
Given the buffer size for GZIP is 512 bytes and it does most of the real work I suspect 8 KB is overkill as it is. I have tried to use NIO with GZIP before but not found it to be any faster (again because most of the delay is the compression) – Peter Lawrey Sep 21 '12 at 14:51
Given the PipedInputStream is created before the thread is started it shouldn't make any difference. – Peter Lawrey Sep 21 '12 at 14:55
I'm testing the above implementation and I'm not seeing the threads utilize two CPUs fully. It seems that the nature of the blocking IO in PipedOutputStream is causing the main thread to wait until the async-output-thread has finished writing its data, therefore causing a bottleneck, and performance is not any better than if the work had been done on a single thread. – cstroe Sep 09 '16 at 16:17

score 1 · Answer 2 · answered Oct 27 '19 at 05:03

I published some code that does exactly what you are looking for. It has always frustrated me that Java doesn't automatically pipeline calls like this across multiple threads, in order to overlap computation, compression, and disk I/O:

https://github.com/lukehutch/PipelinedOutputStream

This class splits writing to an OutputStream into separate producer and consumer threads (actually, starts a new thread for the consumer), and inserts a blocking bounded buffer between them. There is some data copying between buffers, but this is done as efficiently as possible.

You can even layer this twice to do the disk writing in a separate thread from the gzip compression, as shown in README.md.

GZIPOutputStream that does its compression in a separate thread

2 Answers2