8

I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.

In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.

Thanks!

Edit

I'm running plain ol' Java SE 6 update 17 on Windows XP.

Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!

Edit 2

I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...

In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?

Edit 3 Code snippet I used:

GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
Rudiger
  • 6,634
  • 9
  • 40
  • 57
  • 1
    Whats your platform and VM Version? – moritz Jan 01 '10 at 21:09
  • 2
    Wouldn't this happen automatically when using Threads? – Jay Jan 01 '10 at 21:12
  • 1
    Related article http://www.codinghorror.com/blog/archives/001231.html – Graphics Noob Jan 01 '10 at 21:21
  • You must learn how to create multi-threaded programs where each thread does lots of small pieces of work from a common work load queue. – Thorbjørn Ravn Andersen Jan 01 '10 at 22:20
  • I am absolutely convinced it is perfectly safe to put GZIPInputStream on a separate thread, and use multiple threads to process multiple independent streams. The benefits depend on how much the access is independent but it is really not that you can only have one on one thread GZIPInputStream opened at time. Voting down. – Audrius Meškauskas Feb 09 '13 at 10:28

9 Answers9

16

AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.

You could, however, have multiple threads, each unzipping a different file.

That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).

More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).

Uri
  • 88,451
  • 51
  • 221
  • 321
  • 2
    +1 for noting the IO bottleneck. This is too often overlooked in this such cases. – BalusC Jan 01 '10 at 21:15
  • 1
    No no no no no no no do NOT multithread I/O! Operating systems already synchronize IO between multiple applications and having additional level of threads on top of the IO abstraction provided and especially reading kills the entire computer if you start using more than one thread for it. – Esko Jan 01 '10 at 21:17
  • My own experience is that it doesn't kill the computer, you just don't see any real benefit from the MT. – Uri Jan 01 '10 at 21:33
  • Multithreaded IO usually leads to lots of HD seeks which, though they may not kill the computer, can definitely send it into coma for a while. – Michael Borgwardt Jan 04 '10 at 16:29
6

PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.

Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.

See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.

George
  • 71
  • 1
2

Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.

OutputStream out = new BufferedOutputStream(
    new GZIPOutputStream(
        new FileOutputStream(myFile)
    )
)

And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.

Sam Barnum
  • 10,559
  • 3
  • 54
  • 60
  • Should I wrap my GZIP streams in Buffered streams, or my Buffered streams in GZIP streams? For example: new GZIPInputStream(new BufferedInputStream(...)) new GZIPOutputStream(new BufferedOutputStream(...)) vs. new BufferedInputStream(new GZIPInputStream(...)) new BufferedOutputStream(new GZIPOutputStream(...)) – Rudiger Jan 01 '10 at 21:36
  • I believe you should wrapper your BufferedStream around your GZIP stream. This will make your I/O more independent of whatever blocking the unZIPper is doing. – Carl Smotricz Jan 01 '10 at 21:48
  • 1
    i have always believed the GZIPOutputStream is already buffered – Chii Jan 01 '10 at 23:25
  • That it has a `flush()` method would hint that it's buffered, yes. Also, there's a constructor that lets you specify the buffer size. Enough evidence for me :) – Carl Smotricz Jan 02 '10 at 14:56
  • GZIPInputStream may be buffered already, but when I explicitly added a BufferedInputStream around it, it went several times faster. – Rudiger Jan 02 '10 at 21:01
  • @Rudiger, can you poste a code snippet of you you're using the stream? Are you using an ObjectOutputStream? – Sam Barnum Jan 03 '10 at 00:07
  • Just a FileInputStream; code snippet is attached to the original question. – Rudiger Jan 04 '10 at 16:22
  • 2
    Maybe it's just the default buffer size of GZIPOutputStream (512) vs. the default buffer size of BufferedOutputStream ( 8192 ). I'd be curious if you get good results from removing the Buffered stream and just upping the buffer size to 8192. – Sam Barnum Jan 06 '10 at 00:39
2

I'm not seeing any answer addressing the other processing of your program.

If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.

If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.

You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.


Update

It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.

Carl Smotricz
  • 66,391
  • 18
  • 125
  • 167
  • I'm not just extracting files using Java, but I can see it was a little ambiguous in my question. – Rudiger Jan 01 '10 at 22:04
0

I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:

  • Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
  • Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
  • The question is too general to assume we are reading from a single hard drive.

You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.

You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.

Audrius Meškauskas
  • 20,936
  • 12
  • 75
  • 93
0

Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.

President James K. Polk
  • 40,516
  • 21
  • 95
  • 125
0

compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.

  • Maybe someone can write zipped input/output stream that follows the same API as InputStream and OutputStream, but for the multi-core era. However, I/O is the bottleneck. – Rudiger Jan 01 '10 at 21:51
  • the it wouldn't be gzip anymore it would be some custom format –  Jan 03 '10 at 03:25
0

Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.

However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.

Michael Dillon
  • 31,973
  • 6
  • 70
  • 106
0

You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.

Luke Hutchison
  • 8,186
  • 2
  • 45
  • 40