What Is The Best Python Zip Module To Handle Large Files?

Question

EDIT: Specifically compression and extraction speeds.

Any Suggestions?

Thanks

have you compared the performance of zipfile to that of using zip/unzip directly in the shell? — John La Rooy, Nov 19 '09 at 00:57
Related: https://stackoverflow.com/questions/4997910/python-unzip-tremendously-slow, https://stackoverflow.com/questions/61930445/fast-zip-decryption-in-python — Albert, Aug 24 '23 at 20:42

score 15 · Accepted Answer · answered Nov 19 '09 at 03:32

So I made a random-ish large zipfile:

$ ls -l *zip
-rw-r--r--  1 aleax  5000  115749854 Nov 18 19:16 large.zip
$ unzip -l large.zip | wc
   23396   93633 2254735

i.e., 116 MB with 23.4K files in it, and timed things:

$ time unzip -d /tmp large.zip >/dev/null

real    0m14.702s
user    0m2.586s
sys         0m5.408s

this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can be. Then (after cleaning up /tmp;-)...:

$ time py26 -c'from zipfile import ZipFile; z=ZipFile("large.zip"); z.extractall("/tmp")'

real    0m13.274s
user    0m5.059s
sys         0m5.166s

...and this is Python with its standard library - a bit more demanding of CPU time, but over 10% faster in real, that is, elapsed time.

You're welcome to repeat such measurements of course (on your specific platform -- if it's CPU-poor, e.g a slow ARM chip, then Python's extra demands of CPU time may end up making it slower -- and your specific zipfiles of interest, since each large zipfile will have a very different mix and quite possibly performance). But what this suggests to me is that there isn't that much space to build a Python extension much faster than good old zipfile -- since Python using it beats the pure-C, system-included unzip!-)

It would be nice to see memory usage measurements too. +1 anyway. — Denis Otkidach, Nov 19 '09 at 09:56
Apparently, your mileage may vary... http://dmarkey.com/wordpress/2011/10/15/python-zipfile-speedup-tips/ — MarioVilas, Oct 10 '13 at 01:22

score 5 · Answer 2 · answered Nov 19 '09 at 13:47

5

For handling large files without loading them into memory, use the new stream-based methods in Python 2.6's version of zipfile, such as ZipFile.open. Don't use extract or extractall unless you have strongly sanitised the filenames in the ZIP.

(You used to have to read all the bytes into memory, or hack around it like zipstream; this is now obsolete.)

answered Nov 19 '09 at 13:47

bobince

528,062
107
651
834

1

I found opening compressed content on the fly with `zipfile.open()` to be actually slightly _faster_ than opening the same number of files from the file system (i.e. extracted previously from a .zip archive). This is probably because `zipfile.open()` uses the already open .zip and does not require the overhead of file system directory and file open operations. Disclaimer: I had to process many small files with a weak compression ratio. YMMV with bigger files or when there are not so many files in the archive. I used Python 3.5.3. – Adrian W Jun 08 '18 at 13:26

What Is The Best Python Zip Module To Handle Large Files?

2 Answers2

Linked