Multithreaded files unzip using java.util.zip.ZipEntry

Question

I'm using:

java.util.zip

I have a while loop reading the buffer until it's clear. I'm reading 2 or more files from a folder but i want something faster. I want to use Threads. If i use a thread for every file then when i unzip a 1GB file i'm not going to see any difference when unzipping smaller files too.

How can i share that job with Threads? I can't read the stream from different Threads (can i?).

No you can't. What makes you think it would be faster? And what does `ZipOutputStream` have to do with it? The best performance improvement you can get would be to put a `BufferedInputStream` both above and below the `ZipInputStream.` — user207421, Feb 18 '14 at 04:02
@EJP: I don’t think that `BufferedInputStream`s improve the performance here. They just add additional data copying to a process which handles the data in chunks already. — Holger, Feb 18 '14 at 10:13

score 0 · Accepted Answer · answered Feb 18 '14 at 10:21

0

The average today’s computer handles zip decompression much faster than a harddisk can provide the data. This applies to most SSDs as well as the bus is the limiting factor.

So any attempt to speed up the process by changing the CPU utilization will fail. The best thing you can do is to separate reading and writing which might add a gain if source and target are on different devices.

Or to make the processing after the decompression multi-threaded. But if you are just reading and dropping the data there is no way to accelerate the process significantly.

answered Feb 18 '14 at 10:21

Holger

285,553
42
434
765

The answers here seem to suggest otherwise: http://stackoverflow.com/questions/20717897/multithreaded-unzipping-in-java – laxxy Apr 02 '14 at 16:16
@laxxy: go ahead try it yourself. If you can reproduce the results *then* you have a reason to downvote. But it is absurd to downvote my answer because of a user with a rep of 79 who *claimed* to have implemented such a thing with a different result, but did not show a single line of code or any other possibility to verify his results. – Holger Apr 03 '14 at 08:24
The thing is, my experience so far has been in line with his answer. For example, if the speed is mostly determined by I/O, why does a multithreaded lbzip2 (that is implementing, I believe, a more CPU-intensive algorithm) typically takes less than half the time of zip/unzip in either direction? Or why does a copy operation between the same file systems take much less than that? Also, we do not know how fast is the disk of the person who originally answered the question: what if it's a RAM drive? I'll do some more testing though. – laxxy Apr 03 '14 at 18:35
@laxxy: A copy operation within the same file system can be performed without copying the file system’s buffer contents to the application’s address space. That obviously can be faster but that has nothing to do with multi-threading. But he claims that reading the entire contents into a byte buffer and opening the *single threaded* `ZipInputStream` on it using ByteArrayInputStream afterwards is faster than opening the same `ZipInputStream` impl on the file directly. If these numbers are right it would indicate a serious flaw in the `ZipInputStream` which has nothing to do with multi-threading. – Holger Apr 03 '14 at 19:01
I've never actually used ZipInputStream, and was referring to his 4th case with multiple threads. My objection is to the "The average today’s computer handles zip decompression much faster than a harddisk can provide the data" statement. Right now I ran a small experiment: unzipping a single archive (5GB, 21 files) with a single "unzip" (v.6.00, from a linux command line) takes 3'08" first time, 2'59" thereafter (likely due to caching). Unzipping the same file with 21 separate "unzip" commands started in parallel, each requesting one file, took 0'35", this is almost 6x faster. – laxxy Apr 03 '14 at 20:55
@laxxy: It’s very likely that the `unzip` command uses the very same zip decompression library than `ZipInputStream` uses. Maybe we simply lack a good definition of “average today’s computer” as he claimed initially that decompressing a 50MB file took 5s on his computer whereas the other answer suggests <200ms for the same operation. I’m still wondering how reading into a byte array and decompressing *afterwards* can take the same time than reading into a byte array without decompression. Unless he found a magic way of speeding up the reading into the byte array, that’s just impossible. – Holger Apr 04 '14 at 08:07

score 0 · Answer 2 · answered Jul 09 '18 at 22:57

If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.

Protip: if you really care about speed, you can get more performance by reading all the ZipEntry objects from the first ZipFile instance, and sharing them with all threads, to avoid duplicating work in reading ZipEntry objects for each thread separately. A ZipEntry object is not tied to a specific ZipFile instance per se, ZipEntry just records metadata that will work with any ZipFile object representing the same zipfile that the ZipEntry came from.

Multithreaded files unzip using java.util.zip.ZipEntry

2 Answers2