4

i'm using Apache Commons Compress for Java to compress multiple log files to a single tar.bz2 archive.

However, it takes really long (> 12 hours) to compress, because i compress around 20GB of files a day.

As this library compresses files mono-threaded, i'd like to know if there is a way to do this multi-threaded.

I found many solutions (Commandline pbzip2 or some C++ libraries) but all i found for java is this blog post:

https://plus.google.com/117421466255362255970/posts/3jfKVu325zh

It seems that i can't use it in my Java application.

Is there anything out there? What would you recommend? Or is there another faster solution with similar compression rates like bzip2 ?

Charles
  • 50,943
  • 13
  • 104
  • 142
Stefan
  • 2,028
  • 2
  • 36
  • 53

3 Answers3

2

As you have multiple files, you can compress each file in a different thread. As your process is CPU bound, I suggest creating a fixed size thread pool i.e. an ExecutorService, and adding a task for each file to compress.

Note: if pbzip2 does what you want, I would call it from Java. You might find it is fast for even one thread as the BZIP2 libraries I have seen for Java are natively implemented (unlike JAR, ZIP and GZIP)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I have multiple files, but it should result in one big tar.bz2 file - so it's just one file to be compressed – Stefan Dec 26 '12 at 21:15
  • Compressed file are serial ie. based on what has happen before. I don't know how the other libraries get around this. You could create a .bz2.tar file. – Peter Lawrey Dec 26 '12 at 21:18
  • @Peter Lawrey: Normaly you do first the *tar* to get compression crossing over more than one file to increase the compression ratio. – MrSmith42 Dec 26 '12 at 21:21
  • @MrSmith42 But if you do that you can't compression portions in parallel. Looking at the docs, it appears that pbzip2 creates a tar file of compressed files. – Peter Lawrey Dec 26 '12 at 21:24
  • @Peter Lawrey: That's right. Parallelism can only be used within the bzip2 algorithm itself. – MrSmith42 Dec 26 '12 at 21:34
1

If a parallel implementation of bzip2 in Java doesn't exit, you can resort to invoking pbzip2 from within your Java application.

reprogrammer
  • 14,298
  • 16
  • 57
  • 93
0

Try at4j implementation of BZip2OutputStream. According to the manual it supports parallel compresion. http://at4j.sourceforge.net/releases/current/pg/ch04.xhtml

af1n
  • 430
  • 4
  • 4