8

Can somebody please explain the following mystery?

I created a binary file of size ~37[MB]. zipping it in Ubuntu -- using the terminal -- took less than 1[sec]. I then tried python: zipping it programatically (using the zipfile module) took also about 1[sec].

I then tried to unzip the zip file I created. In Ubuntu -- using the terminal -- this took less than 1[sec].

In python, the code to unzip (used the zipfile module) took close to 37[sec] to run! any ideas why?

user3262424
  • 7,223
  • 16
  • 54
  • 84
  • 5
    Could you post the part where you are zipping the files? This way, we can make more accurate comments. – Utku Zihnioglu Feb 14 '11 at 22:19
  • I'm guessing the python zip/unzip code is interpreted instead of being a call out to some (compiled C) library. – Thomas M. DuBuisson Feb 14 '11 at 22:22
  • 1
    @TomMD: Actually, it isn't, since it depends on zlib, at least when the file is actually compressed. The actual decompression is done in native code. It might be worth comparing unzip times when the zip file is not compressed, to see if the effect is coming from interpretation. – Chinmay Kanchi Feb 14 '11 at 22:28
  • @chinmay The poster never said how he was calling 'zip' so I didn't want to assume anything. Good to know that the normal Python {,un}zip is a zlib binding though, thanks! – Thomas M. DuBuisson Feb 14 '11 at 23:27
  • 4
    Maybe you're not handling the stream of unzipped data efficiently. Loading a 37 MB-size string in memory will certainly take a long time due to memory allocation and swapping. You should send the output to a file directly. How are you using the `zipfile` module to unzip the compressed file? – scoffey Feb 15 '11 at 16:01
  • 1
    @scoffey: I find it hard to believe that memory allocation/swapping would take _that_ long. 37 MB is _nothing_, even in Python. – hammar Jun 06 '11 at 14:02
  • https://stackoverflow.com/questions/61930445/fast-zip-decryption-in-python, https://stackoverflow.com/questions/37141286/efficiently-read-one-file-from-a-zip-containing-a-lot-of-files-in-python – Albert Aug 24 '23 at 20:38

4 Answers4

2

I was struggling to unzip/decompress/extract zip files with Python as well and that "create ZipFile object, loop through its .namelist(), read the files and write them to file system" low-level approach didn't seem very Python. So I started to dig zipfile objects that I believe not very well documented and covered all the object methods:

>>> from zipfile import ZipFile
>>> filepath = '/srv/pydocfiles/packages/ebook.zip'
>>> zip = ZipFile(filepath)
>>> dir(zip)
['NameToInfo', '_GetContents', '_RealGetContents', '__del__', '__doc__', '__enter__', '__exit__', '__init__', '__module__', '_allowZip64', '_didModify', '_extract_member', '_filePassed', '_writecheck', 'close', 'comment', 'compression', 'debug', 'extract', 'extractall', 'filelist', 'filename', 'fp', 'getinfo', 'infolist', 'mode', 'namelist', 'open', 'printdir', 'pwd', 'read', 'setpassword', 'start_dir', 'testzip', 'write', 'writestr'] 

There we go the "extractall" method works just like tarfile's extractall ! (on python 2.6 and 2.7 but NOT 2.5)

Then the performance concerns; the file ebook.zip is 84.6 MB (mostly pdf files) and uncompressed folder is 103 MB, zipped by default "Archive Utility" under MacOSx 10.5. So I did the same with Python's timeit module:

>>> from timeit import Timer
>>> t = Timer("filepath = '/srv/pydocfiles/packages/ebook.zip'; \
...         extract_to = '/tmp/pydocnet/build'; \
...         from zipfile import ZipFile; \
...         ZipFile(filepath).extractall(path=extract_to)")
>>> 
>>> t.timeit(1)
1.8670060634613037

which took less than 2 seconds on a heavy loaded machine that has 90% of the memory is being used by other applications.

Hope this helps someone.

kirpit
  • 4,419
  • 1
  • 30
  • 32
  • wow, zipfile objects documentation is just updated on docs.python.org a day after I gave this answer. perhaps it was some output issue or python is doing grreeat! – kirpit Nov 07 '11 at 18:05
  • Nice info! However if we need to access just some files, or process them somehow instead of just uncompressing them, this won't help much I'm afraid :( – MarioVilas Oct 10 '13 at 01:21
0

I don't know what code you use to unzip your file, but the following works for me: After creating a zip archive "test.zip" containing just one file "file1", the following Python script extracts "file1" from the archive:

from zipfile import ZipFile, ZIP_DEFLATED
zip = ZipFile("test.zip", mode='r', compression=ZIP_DEFLATED, allowZip64=False)
data = zip.read("file1")
print len(data)

This takes nearly no time: I tried a 37MB input file which compressed down to a 15MB zip archive. In this example the Python script took 0.346 seconds on my MacBook Pro. Maybe in your case the 37 seconds were taken up by something you did with the data instead?

jochen
  • 3,728
  • 2
  • 39
  • 49
  • 3
    Reading just one file is easy - however a large zip archive with many small compressed files in it runs excruciatingly slow for me. Perhaps the file lookup within the zip is inefficient? – MarioVilas Oct 10 '13 at 01:20
0

Instead of using the python module we can use the zip featured offered by ubuntu in python. I use this because sometimes the python zip fails.

import os

filename = test
os.system('7z a %s.zip %s'% (filename, filename))
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • You should use `str.format()` instead of the % formatting, like `os.system('7z a {0}.zip {0}'.format(filename))`. As they mention in the [docs](http://docs.python.org/tutorial/inputoutput.html#old-string-formatting), it's going to be removed in the future and I believe it's already gone in 3+. – thegrinner Jun 06 '11 at 15:46
  • 6
    @thegrinner Wrong. This approach should be avoided at all, and instead `import subprocess; subprocess.call(['7z', 'a', filename+'.zip', filename])` be used. Or what happens if filename contains a space or a newline? – glglgl Nov 06 '11 at 16:14
0

Some options:

  • Use subprocess to defer it to some external tool. You can pipe data directly to it.
  • czipfile, but that does not seem to be maintained anymore (last release 2010). A somewhat recent fork is ziyuang/czipfile (last update 2019).
  • PyTorch has the internal native torch._C.PyTorchFileReader which can read zip files, see the torch.load logic, and _open_zipfile_reader. This does not support arbitrary zip files currently, but I think it only would need minor adaptations to support it.
  • libzip.py (2023) is a ctypes wrapper around libzip. But it seems very unknown?
Albert
  • 65,406
  • 61
  • 242
  • 386